senfu's picture
Upload 9 files
964370b
/home/aiscuser/.local/lib/python3.8/site-packages/scipy/__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.24.4
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
2023/07/19 14:48:47 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of transformers. If you encounter errors during autologging, try upgrading / downgrading transformers to a supported version, or try upgrading MLflow.
2023/07/19 14:48:48 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.
2023/07/19 14:48:48 INFO mlflow.tracking.fluent: Autologging successfully enabled for transformers.
Using the `WAND_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).
Downloading and preparing dataset glue/rte to /home/aiscuser/.cache/huggingface/datasets/glue/rte/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353...
Downloading data: 0%| | 0.00/697k [00:00<?, ?B/s] Downloading data: 31%|β–ˆβ–ˆβ–ˆ | 213k/697k [00:00<00:00, 2.08MB/s] Downloading data: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 697k/697k [00:00<00:00, 4.28MB/s]
Generating train split: 0 examples [00:00, ? examples/s] Generating train split: 2244 examples [00:00, 22359.10 examples/s] Generating validation split: 0 examples [00:00, ? examples/s] Generating test split: 0 examples [00:00, ? examples/s] Generating test split: 2823 examples [00:00, 28163.62 examples/s] Dataset glue downloaded and prepared to /home/aiscuser/.cache/huggingface/datasets/glue/rte/1.0.0/a420f5e518f42454003587c47467370329f9fc0c6508d1ae0c45b58ea266a353. Subsequent calls will reuse this data.
0%| | 0/3 [00:00<?, ?it/s] 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 787.91it/s]
disable token pruning.
enable token pruning. token_prune_loc: [3, 4, 5, 6, 7, 8, 9, 10, 11]
NOTICE: THIS IS PRUNING STAGE
max_seq_length: 256
Running tokenizer on dataset: 0%| | 0/2490 [00:00<?, ? examples/s] Running tokenizer on dataset: 40%|β–ˆβ–ˆβ–ˆβ–ˆ | 1000/2490 [00:00<00:00, 3101.66 examples/s] Running tokenizer on dataset: 80%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ | 2000/2490 [00:00<00:00, 3259.16 examples/s] Running tokenizer on dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2490/2490 [00:00<00:00, 3272.83 examples/s] Running tokenizer on dataset: 0%| | 0/277 [00:00<?, ? examples/s] Running tokenizer on dataset: 0%| | 0/3000 [00:00<?, ? examples/s] Running tokenizer on dataset: 33%|β–ˆβ–ˆβ–ˆβ–Ž | 1000/3000 [00:00<00:00, 4023.11 examples/s] Running tokenizer on dataset: 67%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹ | 2000/3000 [00:00<00:00, 3866.20 examples/s] Running tokenizer on dataset: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3000/3000 [00:00<00:00, 3200.89 examples/s] Downloading builder script: 0%| | 0.00/1.84k [00:00<?, ?B/s] Downloading builder script: 5.76kB [00:00, 5.98MB/s]
double check the prune location is loaded correctly: [3, 4, 5, 6, 7, 8, 9, 10, 11]
double check hard_token_mask: <class 'NoneType'>
Training Arguments
TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=1e-08,
bf16=False,
bf16_full_eval=False,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_steps=50,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=1,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_strategy=HubStrategy.EVERY_SAVE,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=40,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=/mnt/data/device-aware-bert/token_pruning/experiments/RTE/reproduce1/s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100/runs/Jul19_14-48-49_node-0,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=25,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=80.0,
optim=OptimizerNames.ADAMW_HF,
output_dir=/mnt/data/device-aware-bert/token_pruning/experiments/RTE/reproduce1/s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=32,
per_device_train_batch_size=32,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
remove_unused_columns=True,
report_to=['mlflow'],
resume_from_checkpoint=None,
run_name=/mnt/data/device-aware-bert/token_pruning/experiments/RTE/reproduce1/s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100,
save_on_each_node=False,
save_steps=0,
save_strategy=IntervalStrategy.STEPS,
save_total_limit=None,
seed=57,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
Additional Arguments
AdditionalArguments(test=False, ex_name='s0.59_lr5e-05_reglr0.01_alpha0.01_warmup50_bin100', pruning_type='token+pruner', reg_learning_rate=0.01, scheduler_type='linear', freeze_embeddings=True, pretrained_pruned_model=None, droprate_init=0.01, temperature=0.6666666666666666, prepruning_finetune_epochs=1, lagrangian_warmup_epochs=50, target_sparsity=0.59, sparsity_epsilon=0, distillation_path='/mnt/data/device-aware-bert/token_pruning/teachers/RTE', do_distill=True, do_layer_distill=False, layer_distill_version=4, distill_loss_alpha=0.9, distill_ce_loss_alpha=0.01, distill_temp=2.0, use_mac_l0=True, prune_location=[3, 4, 5, 6, 7, 8, 9, 10, 11], bin_num=100, topk=20)
----------------------------------------------------------------------
time: 2023-07-19 14:49:32
Evaluating: accuracy: 0.6823, eval_loss: 1.9479, step: 0
lambda_1: 0.0000, lambda_2: 0.0000 lambda_3: 0.0000
Starting l0 regularization! using <class 'models.l0_module.L0ModuleForMAC'>, temperature: 0.67, init drop rate: 0.01 token_loga shape: [9, 100] prune location: [3, 4, 5, 6, 7, 8, 9, 10, 11]
NDCG TOPK= 20
loss: 0.029388, lagrangian_loss: 0.000382, attention_score_distillation_loss: 0.098598
----------------------------------------------------------------------
time: 2023-07-19 14:49:57
Evaluating: accuracy: 0.6715, eval_loss: 2.0791, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0073, step: 50
lambda_1: -0.4534, lambda_2: 0.5518 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.058469, lagrangian_loss: 0.003047, attention_score_distillation_loss: 0.098007
loss: 0.069943, lagrangian_loss: 0.008408, attention_score_distillation_loss: 0.097426
ETA: 0:50:53 | Epoch 0 finished. Took 38.65 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:50:22
Evaluating: accuracy: 0.6426, eval_loss: 2.4718, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0148, step: 100
lambda_1: -1.2022, lambda_2: 1.4034 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.008282, lagrangian_loss: 0.016422, attention_score_distillation_loss: 0.096749
loss: 1.125452, lagrangian_loss: 0.026686, attention_score_distillation_loss: 0.096124
----------------------------------------------------------------------
time: 2023-07-19 14:50:48
Evaluating: accuracy: 0.6679, eval_loss: 2.1584, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0224, step: 150
lambda_1: -1.9793, lambda_2: 2.3241 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.462131, lagrangian_loss: 0.038737, attention_score_distillation_loss: 0.095322
ETA: 0:51:20 | Epoch 1 finished. Took 40.33 seconds.
loss: 0.007580, lagrangian_loss: 0.051961, attention_score_distillation_loss: 0.094810
----------------------------------------------------------------------
time: 2023-07-19 14:51:13
Evaluating: accuracy: 0.6787, eval_loss: 2.0548, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.03, step: 200
lambda_1: -2.7483, lambda_2: 3.2444 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.542208, lagrangian_loss: 0.068293, attention_score_distillation_loss: 0.093928
loss: 0.710257, lagrangian_loss: 0.085466, attention_score_distillation_loss: 0.093510
ETA: 0:50:30 | Epoch 2 finished. Took 39.1 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:51:39
Evaluating: accuracy: 0.6643, eval_loss: 1.8392, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0375, step: 250
lambda_1: -3.5145, lambda_2: 4.1713 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.132700, lagrangian_loss: 0.102254, attention_score_distillation_loss: 0.092884
loss: 0.447086, lagrangian_loss: 0.122965, attention_score_distillation_loss: 0.092084
----------------------------------------------------------------------
time: 2023-07-19 14:52:05
Evaluating: accuracy: 0.6968, eval_loss: 1.9262, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0451, step: 300
lambda_1: -4.2695, lambda_2: 5.0840 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 1.]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.139396, lagrangian_loss: 0.146520, attention_score_distillation_loss: 0.091695
ETA: 0:50:14 | Epoch 3 finished. Took 40.56 seconds.
loss: 0.013123, lagrangian_loss: 0.171388, attention_score_distillation_loss: 0.090984
----------------------------------------------------------------------
time: 2023-07-19 14:52:30
Evaluating: accuracy: 0.6498, eval_loss: 2.3036, token_prune_loc: [False, False, False, False, False, False, False, False, False], macs_sparsity: 0.0, expected_sparsity: 0.0, expected_sequence_sparsity: 0.7429, target_sparsity: 0.0526, step: 350
lambda_1: -5.0326, lambda_2: 6.0202 lambda_3: 0.0000
train remain: [1. 1. 1. 1. 1. 1. 1. 1. 0.98]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
loss: 0.224630, lagrangian_loss: 0.195066, attention_score_distillation_loss: 0.090328
loss: 0.100729, lagrangian_loss: 0.217814, attention_score_distillation_loss: 0.089662
ETA: 0:49:16 | Epoch 4 finished. Took 38.46 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:52:56
Evaluating: accuracy: 0.6787, eval_loss: 1.8973, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.013, expected_sparsity: 0.0109, expected_sequence_sparsity: 0.7457, target_sparsity: 0.0602, step: 400
lambda_1: -5.7785, lambda_2: 6.9207 lambda_3: 0.0000
train remain: [1. 1. 1. 0.99 1. 1. 1. 1. 0.95]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.92]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1101111111111111111111111111110111111111111111111111111111111111111101011111111111111111111100111010
loss: 0.006938, lagrangian_loss: 0.234816, attention_score_distillation_loss: 0.089078
loss: 0.279214, lagrangian_loss: 0.255123, attention_score_distillation_loss: 0.088445
----------------------------------------------------------------------
time: 2023-07-19 14:53:21
Evaluating: accuracy: 0.6823, eval_loss: 2.0859, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0152, expected_sparsity: 0.015, expected_sequence_sparsity: 0.7467, target_sparsity: 0.0678, step: 450
lambda_1: -6.4900, lambda_2: 7.7537 lambda_3: 0.0000
train remain: [0.99 1. 1. 1. 1. 0.99 1. 0.99 0.92]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.89]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.89]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1101111111111111111111111111110111111111111111111101111111111111011101011111011111111111111100111010
loss: 0.003668, lagrangian_loss: 0.274670, attention_score_distillation_loss: 0.087722
ETA: 0:48:51 | Epoch 5 finished. Took 40.57 seconds.
loss: 0.004139, lagrangian_loss: 0.295678, attention_score_distillation_loss: 0.087193
----------------------------------------------------------------------
time: 2023-07-19 14:53:47
Evaluating: accuracy: 0.7076, eval_loss: 2.0329, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0196, expected_sparsity: 0.0177, expected_sequence_sparsity: 0.7474, target_sparsity: 0.0753, step: 500
lambda_1: -7.1801, lambda_2: 8.5509 lambda_3: 0.0000
train remain: [0.99 1. 1. 1. 1. 0.99 0.99 0.99 0.89]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1100111111111111111111111111110111111111111111111101111111111111011101011111111111111101110100111010
loss: 0.009879, lagrangian_loss: 0.315272, attention_score_distillation_loss: 0.085547
loss: 0.503761, lagrangian_loss: 0.329236, attention_score_distillation_loss: 0.085897
ETA: 0:48:04 | Epoch 6 finished. Took 38.94 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:54:12
Evaluating: accuracy: 0.704, eval_loss: 1.9875, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0217, expected_sparsity: 0.0205, expected_sequence_sparsity: 0.7481, target_sparsity: 0.0829, step: 550
lambda_1: -7.8482, lambda_2: 9.3105 lambda_3: 0.0000
train remain: [0.99 1. 1. 0.99 1. 0.99 0.99 0.99 0.86]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1100111111111111111111111111110111111111111111111101111111111111011101011011011111111101110100111010
loss: 0.144107, lagrangian_loss: 0.354074, attention_score_distillation_loss: 0.085122
loss: 0.331107, lagrangian_loss: 0.369864, attention_score_distillation_loss: 0.084503
----------------------------------------------------------------------
time: 2023-07-19 14:54:37
Evaluating: accuracy: 0.7076, eval_loss: 1.8793, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0239, expected_sparsity: 0.0218, expected_sequence_sparsity: 0.7485, target_sparsity: 0.0905, step: 600
lambda_1: -8.5034, lambda_2: 10.0515 lambda_3: 0.0000
train remain: [0.99 1. 1. 0.99 1. 0.98 0.99 0.98 0.85]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1100111111111111111111111111110111111111111111111101111111111111011101011011011111111101110100101010
loss: 0.194772, lagrangian_loss: 0.387276, attention_score_distillation_loss: 0.083909
ETA: 0:47:30 | Epoch 7 finished. Took 40.15 seconds.
loss: 0.777043, lagrangian_loss: 0.396693, attention_score_distillation_loss: 0.083216
----------------------------------------------------------------------
time: 2023-07-19 14:55:02
Evaluating: accuracy: 0.7076, eval_loss: 2.1299, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0261, expected_sparsity: 0.0245, expected_sequence_sparsity: 0.7492, target_sparsity: 0.098, step: 650
lambda_1: -9.1318, lambda_2: 10.7424 lambda_3: 0.0000
train remain: [0.99 1. 1. 0.99 1. 0.97 0.99 0.98 0.83]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1100111111111111111111111111110111111111111111111100111111101111011101011011011111111101110100101010
loss: 1.046288, lagrangian_loss: 0.379496, attention_score_distillation_loss: 0.082634
loss: 0.018136, lagrangian_loss: 0.376472, attention_score_distillation_loss: 0.082042
----------------------------------------------------------------------
time: 2023-07-19 14:55:28
Evaluating: accuracy: 0.6787, eval_loss: 2.1448, token_prune_loc: [False, False, False, False, False, False, False, False, True], macs_sparsity: 0.0282, expected_sparsity: 0.0273, expected_sequence_sparsity: 0.7499, target_sparsity: 0.1056, step: 700
lambda_1: -9.6937, lambda_2: 11.3030 lambda_3: 0.0000
train remain: [0.99 1. 1. 0.99 1. 0.97 0.98 0.96 0.81]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1000111111111111111111111111110111111111111111111100111111101111011101011011011111111101110100001010
loss: 0.013614, lagrangian_loss: 0.369929, attention_score_distillation_loss: 0.081239
ETA: 0:46:54 | Epoch 8 finished. Took 40.03 seconds.
loss: 0.009624, lagrangian_loss: 0.335372, attention_score_distillation_loss: 0.080678
----------------------------------------------------------------------
time: 2023-07-19 14:55:53
Evaluating: accuracy: 0.6823, eval_loss: 2.034, token_prune_loc: [False, False, False, False, False, False, False, True, True], macs_sparsity: 0.0505, expected_sparsity: 0.0489, expected_sequence_sparsity: 0.7555, target_sparsity: 0.1132, step: 750
lambda_1: -10.1785, lambda_2: 11.7273 lambda_3: 0.0000
train remain: [0.98 1. 1. 0.99 1. 0.95 0.98 0.94 0.79]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.78]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.9, 0.7]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111101111111011111111101111011111111111111111111101011111111111111110110111111111111111110111110
1000111111111111111111111111110111111111111111111100111101101111011101011011011111111101110000001010
loss: 0.338716, lagrangian_loss: 0.296148, attention_score_distillation_loss: 0.079852
loss: 0.412922, lagrangian_loss: 0.258776, attention_score_distillation_loss: 0.079081
ETA: 0:46:07 | Epoch 9 finished. Took 38.59 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:56:19
Evaluating: accuracy: 0.6895, eval_loss: 2.0758, token_prune_loc: [False, False, False, False, False, True, False, True, True], macs_sparsity: 0.0959, expected_sparsity: 0.0929, expected_sequence_sparsity: 0.7669, target_sparsity: 0.1207, step: 800
lambda_1: -10.5439, lambda_2: 11.9697 lambda_3: 0.0000
train remain: [0.98 1. 0.99 0.99 1. 0.92 0.97 0.93 0.77]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 1.0, 0.89, 0.76]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.78, 0.6]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1011111111111101111111111111111111111111111011111011111111111111101111111111011111110011111011101100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111101111111011111111101111011111111111111111111101011111111111111110110111111111111101110111110
0000111111111111111111111111110111111111111111111100111101101111011101011011011111111101010000001010
loss: 0.395373, lagrangian_loss: 0.190279, attention_score_distillation_loss: 0.078797
loss: 0.791593, lagrangian_loss: 0.186629, attention_score_distillation_loss: 0.077928
----------------------------------------------------------------------
time: 2023-07-19 14:56:43
Evaluating: accuracy: 0.6715, eval_loss: 2.2295, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1238, expected_sparsity: 0.1211, expected_sequence_sparsity: 0.7742, target_sparsity: 0.1283, step: 850
lambda_1: -10.8044, lambda_2: 12.0899 lambda_3: 0.0000
train remain: [0.98 0.99 0.99 0.99 1. 0.91 0.95 0.93 0.75]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.9, 0.88, 0.74]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.87, 0.78, 0.69, 0.51]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1011111111111101111111111111111111111111111011111011111111111111101101111111011111110011111011101100
1111111111111011111111111111111111111110111111111111111101100111111111110111111111101111111111011100
1111111101111111011111111101111011111111111111111111101011111111111101110110111111111111101110111110
0000111111111111111111111111110111111111111111111100111101101111011101011011011111110100010000001010
loss: 0.006347, lagrangian_loss: 0.152384, attention_score_distillation_loss: 0.077399
ETA: 0:45:31 | Epoch 10 finished. Took 40.05 seconds.
loss: 0.005691, lagrangian_loss: 0.130721, attention_score_distillation_loss: 0.076740
----------------------------------------------------------------------
time: 2023-07-19 14:57:09
Evaluating: accuracy: 0.6751, eval_loss: 2.1946, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1307, expected_sparsity: 0.1285, expected_sequence_sparsity: 0.7761, target_sparsity: 0.1359, step: 900
lambda_1: -10.9998, lambda_2: 12.1561 lambda_3: 0.0000
train remain: [0.98 0.99 0.99 0.99 1. 0.89 0.94 0.91 0.73]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.89, 0.87, 0.73]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.86, 0.77, 0.67, 0.49]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111111011111011111111111111101101111111011111110011111011101100
1111111111111011111111111111111111111110111111111111110101100111111111110111111111101111111111011100
1111111101111111011111111101111011111111111111111111101011111111111101110110111111111111101110111010
0000111111111111111111111111110111111111111111111100111101101111010101011011011111110100010000001010
loss: 0.767037, lagrangian_loss: 0.098042, attention_score_distillation_loss: 0.076120
loss: 0.021643, lagrangian_loss: 0.079077, attention_score_distillation_loss: 0.075473
ETA: 0:44:47 | Epoch 11 finished. Took 38.89 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:57:35
Evaluating: accuracy: 0.7004, eval_loss: 1.9245, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1403, expected_sparsity: 0.1357, expected_sequence_sparsity: 0.778, target_sparsity: 0.1434, step: 950
lambda_1: -11.1225, lambda_2: 12.1820 lambda_3: 0.0000
train remain: [0.98 0.99 0.99 0.99 1. 0.88 0.92 0.9 0.72]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.88, 0.86, 0.72]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.75, 0.64, 0.46]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111111011111011111111111111101101111111011111110011011011101100
1111111111111011111111111111111111111110111111111111110101100111111111110111111111101101111111011100
1111111101111111011111111111111011111111111111111111101011111111111001110110111111111111101110111000
0000111111111111111111111111110111111111111111111100111101101111010101011011010111110100010000001010
loss: 0.012502, lagrangian_loss: 0.050715, attention_score_distillation_loss: 0.074825
loss: 0.007984, lagrangian_loss: 0.016869, attention_score_distillation_loss: 0.074227
----------------------------------------------------------------------
time: 2023-07-19 14:58:00
Evaluating: accuracy: 0.6534, eval_loss: 2.4658, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1477, expected_sparsity: 0.1427, expected_sequence_sparsity: 0.7798, target_sparsity: 0.151, step: 1000
lambda_1: -11.1668, lambda_2: 12.1867 lambda_3: 0.0000
train remain: [0.98 0.99 0.99 0.99 1. 0.87 0.91 0.88 0.71]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.87, 0.85, 0.71]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.73, 0.62, 0.44]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111111011111011111111111111101101111101011111110011011011101100
1111111111111011111111111111111111111110111111111111110101100111111111110111111111101101110111011100
1111111101111111011111111111111011101111111111111111101011111111111001110110111111111111101110111000
0000111111111111111111111111110111111111111111111100111101101111010101011011010111110100010000000010
loss: 0.275271, lagrangian_loss: 0.002323, attention_score_distillation_loss: 0.073470
ETA: 0:44:13 | Epoch 12 finished. Took 40.56 seconds.
loss: 0.464858, lagrangian_loss: -0.017623, attention_score_distillation_loss: 0.072871
----------------------------------------------------------------------
time: 2023-07-19 14:58:26
Evaluating: accuracy: 0.6679, eval_loss: 2.22, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1511, expected_sparsity: 0.1468, expected_sequence_sparsity: 0.7808, target_sparsity: 0.1585, step: 1050
lambda_1: -11.1517, lambda_2: 12.1875 lambda_3: 0.0000
train remain: [0.98 0.99 0.99 0.99 1. 0.86 0.89 0.88 0.7 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.86, 0.84, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.72, 0.61, 0.42]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111111011111011111111111111101100111111011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111111110111111111101101110111011100
1111111101111111011111111101111011101111111111111111101011111111111001110110111111111111101110111000
0000111111111111111111111111110111101111111111111100111101101111010101011011010111110100010000000010
loss: 0.279965, lagrangian_loss: -0.024106, attention_score_distillation_loss: 0.072124
loss: 0.203552, lagrangian_loss: -0.046333, attention_score_distillation_loss: 0.071598
ETA: 0:43:30 | Epoch 13 finished. Took 38.9 seconds.
----------------------------------------------------------------------
time: 2023-07-19 14:58:51
Evaluating: accuracy: 0.6606, eval_loss: 2.2472, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1559, expected_sparsity: 0.1535, expected_sequence_sparsity: 0.7826, target_sparsity: 0.1661, step: 1100
lambda_1: -11.0900, lambda_2: 12.1935 lambda_3: 0.0000
train remain: [0.97 0.99 0.99 0.98 1. 0.85 0.88 0.86 0.7 ]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.85, 0.83, 0.69]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.71, 0.59, 0.4]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111111011111011111111111111101100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111111110111111111101001110111011100
1111111101111111011111111101111011101111111111111111101011111111111001110110111111111111101110101000
0000111111111111111111111111110111101111111111111000111101101111010101011011010111110100010000000010
loss: 0.163342, lagrangian_loss: -0.059517, attention_score_distillation_loss: 0.071035
loss: 0.452755, lagrangian_loss: -0.076871, attention_score_distillation_loss: 0.070374
----------------------------------------------------------------------
time: 2023-07-19 14:59:17
Evaluating: accuracy: 0.6462, eval_loss: 2.4025, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1572, expected_sparsity: 0.1554, expected_sequence_sparsity: 0.7831, target_sparsity: 0.1737, step: 1150
lambda_1: -10.9755, lambda_2: 12.2126 lambda_3: 0.0000
train remain: [0.97 0.99 0.99 0.98 1. 0.84 0.87 0.85 0.69]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.84, 0.83, 0.69]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.7, 0.58, 0.4]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111111011111011111111111111101100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111111110111011111101001110111011100
1111111101111111011111111101111011101111111111111111101011111111111001110110111111111111101110101000
0000111111111111111111111111110111101111111111111000111101101111010101011011010111110100010000000010
loss: 0.230774, lagrangian_loss: -0.083824, attention_score_distillation_loss: 0.069834
ETA: 0:42:57 | Epoch 14 finished. Took 40.97 seconds.
loss: 0.463707, lagrangian_loss: -0.088518, attention_score_distillation_loss: 0.069067
----------------------------------------------------------------------
time: 2023-07-19 14:59:43
Evaluating: accuracy: 0.6534, eval_loss: 2.2773, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1667, expected_sparsity: 0.162, expected_sequence_sparsity: 0.7847, target_sparsity: 0.1812, step: 1200
lambda_1: -10.8239, lambda_2: 12.2443 lambda_3: 0.0000
train remain: [0.97 0.99 0.99 0.98 1. 0.84 0.86 0.85 0.68]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.83, 0.82, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.68, 0.56, 0.38]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111101011111011111111111111101100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111110110111011111101001110111011100
1111111101111111011111111101111011101011111111111111101011111111111001110110111111111111101110101000
0000111111111111111111111111110111101111111111111000111101101111010101011011010111110100010000000000
loss: 0.358757, lagrangian_loss: -0.096495, attention_score_distillation_loss: 0.068505
loss: 0.433719, lagrangian_loss: -0.119376, attention_score_distillation_loss: 0.067795
ETA: 0:42:13 | Epoch 15 finished. Took 38.59 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:00:08
Evaluating: accuracy: 0.6823, eval_loss: 2.2446, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.168, expected_sparsity: 0.1646, expected_sequence_sparsity: 0.7854, target_sparsity: 0.1888, step: 1250
lambda_1: -10.6231, lambda_2: 12.2994 lambda_3: 0.0000
train remain: [0.96 0.99 0.99 0.97 1. 0.83 0.85 0.85 0.68]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.82, 0.67]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.67, 0.55, 0.37]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111101011111011111111111111101100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111110110111011111101000110111011100
1111111101111111011111111101111011101111111111111111100011111111111001110110111111111111101110101000
0000111111111111111111111111110111101111111111111000111100101111010101011011010111110100010000000000
loss: 0.315280, lagrangian_loss: -0.130742, attention_score_distillation_loss: 0.067091
loss: 0.726314, lagrangian_loss: -0.143399, attention_score_distillation_loss: 0.066571
----------------------------------------------------------------------
time: 2023-07-19 15:00:34
Evaluating: accuracy: 0.6968, eval_loss: 1.8997, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1715, expected_sparsity: 0.1684, expected_sequence_sparsity: 0.7864, target_sparsity: 0.1964, step: 1300
lambda_1: -10.3442, lambda_2: 12.4021 lambda_3: 0.0000
train remain: [0.96 0.99 0.99 0.96 1. 0.83 0.84 0.84 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.82, 0.81, 0.67]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.66, 0.54, 0.36]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111110110111011111101000110111011100
1111111101111111011111111101111011101111111111111111100011111111111001110110111101111111101110101000
0000111111111111111111111111110111101111111111111000111100101111010101011011010111110100010000000000
loss: 0.004618, lagrangian_loss: -0.156081, attention_score_distillation_loss: 0.065823
loss: 0.316054, lagrangian_loss: -0.168408, attention_score_distillation_loss: 0.065266
ETA: 0:41:37 | Epoch 16 finished. Took 40.61 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:00:59
Evaluating: accuracy: 0.657, eval_loss: 2.5078, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1728, expected_sparsity: 0.1702, expected_sequence_sparsity: 0.7869, target_sparsity: 0.2039, step: 1350
lambda_1: -10.0052, lambda_2: 12.5523 lambda_3: 0.0000
train remain: [0.96 0.98 0.98 0.95 1. 0.83 0.83 0.84 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.81, 0.67]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.66, 0.53, 0.36]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111101100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110101100111111110110111011111101000100111011100
1111111101111111011111111101111011101011111111111111100011111111111001110110111111111111101110101000
0000111111111111111111111111110111101111111111111000111100101111010101011011010111110100010000000000
loss: 0.009137, lagrangian_loss: -0.162867, attention_score_distillation_loss: 0.064569
loss: 0.007877, lagrangian_loss: -0.177110, attention_score_distillation_loss: 0.063956
----------------------------------------------------------------------
time: 2023-07-19 15:01:24
Evaluating: accuracy: 0.639, eval_loss: 2.5564, token_prune_loc: [False, False, False, False, False, True, True, True, True], macs_sparsity: 0.1754, expected_sparsity: 0.1727, expected_sequence_sparsity: 0.7875, target_sparsity: 0.2115, step: 1400
lambda_1: -9.6294, lambda_2: 12.7362 lambda_3: 0.0000
train remain: [0.96 0.98 0.98 0.94 1. 0.82 0.82 0.84 0.67]
infer remain: [1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.8, 0.81, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.65, 0.52, 0.35]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011111101000100111011100
1111111101111111011111111101111011101011111111111111100011111111111001110110111111111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000
loss: 1.071051, lagrangian_loss: -0.177870, attention_score_distillation_loss: 0.063342
ETA: 0:40:59 | Epoch 17 finished. Took 39.99 seconds.
loss: 0.046190, lagrangian_loss: -0.192143, attention_score_distillation_loss: 0.062673
----------------------------------------------------------------------
time: 2023-07-19 15:01:49
Evaluating: accuracy: 0.6968, eval_loss: 2.1858, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2352, expected_sparsity: 0.2299, expected_sequence_sparsity: 0.8023, target_sparsity: 0.2191, step: 1450
lambda_1: -9.1886, lambda_2: 12.9919 lambda_3: 0.0000
train remain: [0.95 0.98 0.98 0.92 0.99 0.82 0.82 0.83 0.66]
infer remain: [1.0, 1.0, 1.0, 0.85, 1.0, 0.81, 0.8, 0.8, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.85, 0.69, 0.55, 0.44, 0.29]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111101101011111111111110111111110111111111011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111111111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011111101000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000
loss: 0.035929, lagrangian_loss: -0.209627, attention_score_distillation_loss: 0.062113
loss: 0.551003, lagrangian_loss: -0.196629, attention_score_distillation_loss: 0.061320
ETA: 0:40:17 | Epoch 18 finished. Took 38.93 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:02:15
Evaluating: accuracy: 0.6606, eval_loss: 2.2157, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2387, expected_sparsity: 0.2335, expected_sequence_sparsity: 0.8033, target_sparsity: 0.2266, step: 1500
lambda_1: -8.6754, lambda_2: 13.3431 lambda_3: 0.0000
train remain: [0.95 0.98 0.98 0.91 0.99 0.82 0.81 0.83 0.66]
infer remain: [1.0, 1.0, 1.0, 0.85, 1.0, 0.8, 0.79, 0.8, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.85, 0.85, 0.68, 0.54, 0.43, 0.28]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111101101011111111111110111111110111111111011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011111100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000
loss: 0.022867, lagrangian_loss: -0.213551, attention_score_distillation_loss: 0.060804
loss: 0.218195, lagrangian_loss: -0.210707, attention_score_distillation_loss: 0.060107
----------------------------------------------------------------------
time: 2023-07-19 15:02:41
Evaluating: accuracy: 0.657, eval_loss: 2.3756, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2439, expected_sparsity: 0.2372, expected_sequence_sparsity: 0.8042, target_sparsity: 0.2342, step: 1550
lambda_1: -8.1008, lambda_2: 13.7943 lambda_3: 0.0000
train remain: [0.95 0.97 0.98 0.9 0.99 0.82 0.8 0.83 0.66]
infer remain: [1.0, 1.0, 1.0, 0.84, 1.0, 0.8, 0.79, 0.8, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.84, 0.84, 0.67, 0.53, 0.42, 0.28]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111110111111111011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011111100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111110100010000000000
loss: 0.003218, lagrangian_loss: -0.223738, attention_score_distillation_loss: 0.058921
ETA: 0:39:40 | Epoch 19 finished. Took 40.6 seconds.
loss: 0.273641, lagrangian_loss: -0.231929, attention_score_distillation_loss: 0.058897
----------------------------------------------------------------------
time: 2023-07-19 15:03:06
Evaluating: accuracy: 0.6534, eval_loss: 2.3509, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2478, expected_sparsity: 0.2429, expected_sequence_sparsity: 0.8057, target_sparsity: 0.2417, step: 1600
lambda_1: -7.4613, lambda_2: 14.3699 lambda_3: 0.0000
train remain: [0.94 0.97 0.97 0.88 0.99 0.81 0.8 0.82 0.66]
infer remain: [1.0, 1.0, 1.0, 0.83, 1.0, 0.8, 0.78, 0.8, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.83, 0.66, 0.52, 0.41, 0.27]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010111111111011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111111111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.133612, lagrangian_loss: -0.224616, attention_score_distillation_loss: 0.058146
loss: 0.004877, lagrangian_loss: -0.233436, attention_score_distillation_loss: 0.057620
ETA: 0:38:58 | Epoch 20 finished. Took 38.89 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:03:31
Evaluating: accuracy: 0.657, eval_loss: 2.3672, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2478, expected_sparsity: 0.2438, expected_sequence_sparsity: 0.8059, target_sparsity: 0.2493, step: 1650
lambda_1: -6.7702, lambda_2: 15.0619 lambda_3: 0.0000
train remain: [0.94 0.97 0.97 0.87 0.99 0.81 0.8 0.82 0.66]
infer remain: [1.0, 1.0, 1.0, 0.83, 1.0, 0.8, 0.78, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.83, 0.83, 0.66, 0.52, 0.41, 0.27]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111101101011111111111110111111010111111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.908298, lagrangian_loss: -0.215904, attention_score_distillation_loss: 0.056991
loss: 0.573679, lagrangian_loss: -0.190294, attention_score_distillation_loss: 0.056323
----------------------------------------------------------------------
time: 2023-07-19 15:03:56
Evaluating: accuracy: 0.6679, eval_loss: 2.2519, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2526, expected_sparsity: 0.2475, expected_sequence_sparsity: 0.8069, target_sparsity: 0.2569, step: 1700
lambda_1: -6.1248, lambda_2: 15.6818 lambda_3: 0.0000
train remain: [0.94 0.97 0.97 0.87 0.99 0.81 0.8 0.82 0.65]
infer remain: [1.0, 1.0, 1.0, 0.82, 1.0, 0.8, 0.78, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.66, 0.51, 0.4, 0.26]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111101101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111111111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.405243, lagrangian_loss: -0.167420, attention_score_distillation_loss: 0.055674
ETA: 0:38:20 | Epoch 21 finished. Took 40.24 seconds.
loss: 0.009080, lagrangian_loss: -0.138851, attention_score_distillation_loss: 0.055099
----------------------------------------------------------------------
time: 2023-07-19 15:04:22
Evaluating: accuracy: 0.6968, eval_loss: 2.0523, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2526, expected_sparsity: 0.2475, expected_sequence_sparsity: 0.8069, target_sparsity: 0.2644, step: 1750
lambda_1: -5.5905, lambda_2: 16.1181 lambda_3: 0.0000
train remain: [0.94 0.97 0.97 0.86 0.99 0.81 0.8 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.82, 1.0, 0.8, 0.78, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.66, 0.51, 0.4, 0.26]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111101101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100111101011111110011011011101100
1011111111111011111111111111011111111110111111111111110101100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.010331, lagrangian_loss: -0.117505, attention_score_distillation_loss: 0.054452
loss: 0.416026, lagrangian_loss: -0.111376, attention_score_distillation_loss: 0.053778
ETA: 0:37:38 | Epoch 22 finished. Took 38.68 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:04:48
Evaluating: accuracy: 0.6643, eval_loss: 2.3321, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2539, expected_sparsity: 0.2495, expected_sequence_sparsity: 0.8074, target_sparsity: 0.272, step: 1800
lambda_1: -5.1439, lambda_2: 16.4235 lambda_3: 0.0000
train remain: [0.94 0.97 0.96 0.85 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.82, 1.0, 0.79, 0.78, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.82, 0.82, 0.65, 0.51, 0.4, 0.26]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111101101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110101100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.003822, lagrangian_loss: -0.098229, attention_score_distillation_loss: 0.053143
loss: 0.405344, lagrangian_loss: -0.080831, attention_score_distillation_loss: 0.052330
----------------------------------------------------------------------
time: 2023-07-19 15:05:13
Evaluating: accuracy: 0.6498, eval_loss: 2.3126, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2565, expected_sparsity: 0.2531, expected_sequence_sparsity: 0.8083, target_sparsity: 0.2796, step: 1850
lambda_1: -4.7614, lambda_2: 16.6499 lambda_3: 0.0000
train remain: [0.95 0.96 0.96 0.85 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.78, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.5, 0.39, 0.26]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110101100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.005114, lagrangian_loss: -0.065466, attention_score_distillation_loss: 0.051820
ETA: 0:37:00 | Epoch 23 finished. Took 40.23 seconds.
loss: 0.321013, lagrangian_loss: -0.063140, attention_score_distillation_loss: 0.051282
----------------------------------------------------------------------
time: 2023-07-19 15:05:38
Evaluating: accuracy: 0.6787, eval_loss: 2.1906, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2599, expected_sparsity: 0.2545, expected_sequence_sparsity: 0.8087, target_sparsity: 0.2871, step: 1900
lambda_1: -4.4577, lambda_2: 16.7908 lambda_3: 0.0000
train remain: [0.94 0.96 0.96 0.84 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.77, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.49, 0.39, 0.25]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.244342, lagrangian_loss: -0.056144, attention_score_distillation_loss: 0.050648
loss: 0.084355, lagrangian_loss: -0.048615, attention_score_distillation_loss: 0.050072
----------------------------------------------------------------------
time: 2023-07-19 15:06:04
Evaluating: accuracy: 0.6787, eval_loss: 2.2249, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2599, expected_sparsity: 0.2545, expected_sequence_sparsity: 0.8087, target_sparsity: 0.2947, step: 1950
lambda_1: -4.2078, lambda_2: 16.8851 lambda_3: 0.0000
train remain: [0.94 0.96 0.96 0.84 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.77, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.49, 0.39, 0.25]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
ETA: 0:36:22 | Epoch 24 finished. Took 40.36 seconds.
loss: 0.003573, lagrangian_loss: -0.043804, attention_score_distillation_loss: 0.049396
loss: 0.402710, lagrangian_loss: -0.028737, attention_score_distillation_loss: 0.048592
----------------------------------------------------------------------
time: 2023-07-19 15:06:29
Evaluating: accuracy: 0.6462, eval_loss: 2.4393, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2599, expected_sparsity: 0.2545, expected_sequence_sparsity: 0.8087, target_sparsity: 0.3023, step: 2000
lambda_1: -4.0187, lambda_2: 16.9394 lambda_3: 0.0000
train remain: [0.94 0.96 0.95 0.84 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.81, 1.0, 0.79, 0.77, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.81, 0.81, 0.64, 0.49, 0.39, 0.25]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101111110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110101100011111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.002686, lagrangian_loss: -0.025169, attention_score_distillation_loss: 0.048074
loss: 0.416361, lagrangian_loss: -0.021635, attention_score_distillation_loss: 0.047521
ETA: 0:35:41 | Epoch 25 finished. Took 39.16 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:06:55
Evaluating: accuracy: 0.6787, eval_loss: 2.2932, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2652, expected_sparsity: 0.2581, expected_sequence_sparsity: 0.8096, target_sparsity: 0.3098, step: 2050
lambda_1: -3.9092, lambda_2: 16.9577 lambda_3: 0.0000
train remain: [0.93 0.95 0.95 0.83 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.8, 1.0, 0.79, 0.77, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.63, 0.49, 0.38, 0.25]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101101110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.014857, lagrangian_loss: -0.010472, attention_score_distillation_loss: 0.046770
loss: 0.250807, lagrangian_loss: -0.007671, attention_score_distillation_loss: 0.046138
----------------------------------------------------------------------
time: 2023-07-19 15:07:21
Evaluating: accuracy: 0.6282, eval_loss: 2.5998, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2652, expected_sparsity: 0.2581, expected_sequence_sparsity: 0.8096, target_sparsity: 0.3174, step: 2100
lambda_1: -3.8559, lambda_2: 16.9626 lambda_3: 0.0000
train remain: [0.93 0.95 0.95 0.83 0.99 0.81 0.79 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.8, 1.0, 0.79, 0.77, 0.79, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.63, 0.49, 0.38, 0.25]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101101110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.008185, lagrangian_loss: 0.002283, attention_score_distillation_loss: 0.045498
ETA: 0:35:03 | Epoch 26 finished. Took 40.73 seconds.
loss: 0.004606, lagrangian_loss: 0.005938, attention_score_distillation_loss: 0.044921
----------------------------------------------------------------------
time: 2023-07-19 15:07:47
Evaluating: accuracy: 0.6318, eval_loss: 2.5448, token_prune_loc: [False, False, False, True, False, True, True, True, True], macs_sparsity: 0.2652, expected_sparsity: 0.2586, expected_sequence_sparsity: 0.8097, target_sparsity: 0.325, step: 2150
lambda_1: -3.8766, lambda_2: 16.9644 lambda_3: 0.0000
train remain: [0.93 0.95 0.95 0.83 0.99 0.81 0.78 0.81 0.65]
infer remain: [1.0, 1.0, 1.0, 0.8, 1.0, 0.79, 0.77, 0.79, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8, 0.8, 0.63, 0.49, 0.38, 0.25]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101101110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110101100011111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.680501, lagrangian_loss: 0.011555, attention_score_distillation_loss: 0.044267
loss: 0.003173, lagrangian_loss: 0.012474, attention_score_distillation_loss: 0.043623
ETA: 0:34:23 | Epoch 27 finished. Took 39.11 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:08:12
Evaluating: accuracy: 0.639, eval_loss: 2.5054, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3219, expected_sparsity: 0.3174, expected_sequence_sparsity: 0.825, target_sparsity: 0.3325, step: 2200
lambda_1: -3.9486, lambda_2: 16.9719 lambda_3: 0.0000
train remain: [0.93 0.94 0.95 0.82 0.99 0.8 0.78 0.8 0.64]
infer remain: [1.0, 0.88, 1.0, 0.79, 1.0, 0.79, 0.76, 0.79, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.7, 0.7, 0.55, 0.42, 0.33, 0.21]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110001100011111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001110110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.461044, lagrangian_loss: 0.016034, attention_score_distillation_loss: 0.043001
loss: 0.302222, lagrangian_loss: 0.027008, attention_score_distillation_loss: 0.042296
----------------------------------------------------------------------
time: 2023-07-19 15:08:37
Evaluating: accuracy: 0.6282, eval_loss: 2.5437, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3219, expected_sparsity: 0.3181, expected_sequence_sparsity: 0.8251, target_sparsity: 0.3401, step: 2250
lambda_1: -4.0827, lambda_2: 16.9966 lambda_3: 0.0000
train remain: [0.93 0.94 0.95 0.82 0.99 0.8 0.78 0.8 0.64]
infer remain: [1.0, 0.88, 1.0, 0.79, 1.0, 0.79, 0.76, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.7, 0.7, 0.55, 0.42, 0.33, 0.21]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011110111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111111110001100011111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.002845, lagrangian_loss: 0.033180, attention_score_distillation_loss: 0.041716
ETA: 0:33:44 | Epoch 28 finished. Took 40.15 seconds.
loss: 0.322900, lagrangian_loss: 0.041308, attention_score_distillation_loss: 0.041073
----------------------------------------------------------------------
time: 2023-07-19 15:09:02
Evaluating: accuracy: 0.6354, eval_loss: 2.4492, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3258, expected_sparsity: 0.3212, expected_sequence_sparsity: 0.8259, target_sparsity: 0.3476, step: 2300
lambda_1: -4.3109, lambda_2: 17.0635 lambda_3: 0.0000
train remain: [0.93 0.94 0.95 0.81 0.99 0.8 0.78 0.8 0.64]
infer remain: [1.0, 0.88, 1.0, 0.78, 1.0, 0.79, 0.76, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.69, 0.69, 0.54, 0.41, 0.32, 0.21]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011010111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.006759, lagrangian_loss: 0.055445, attention_score_distillation_loss: 0.040399
loss: 0.370006, lagrangian_loss: 0.059518, attention_score_distillation_loss: 0.039905
ETA: 0:33:03 | Epoch 29 finished. Took 39.34 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:09:28
Evaluating: accuracy: 0.6065, eval_loss: 2.6959, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.3258, expected_sparsity: 0.3212, expected_sequence_sparsity: 0.8259, target_sparsity: 0.3552, step: 2350
lambda_1: -4.6406, lambda_2: 17.2022 lambda_3: 0.0000
train remain: [0.92 0.94 0.94 0.81 0.99 0.8 0.78 0.8 0.64]
infer remain: [1.0, 0.88, 1.0, 0.78, 1.0, 0.79, 0.76, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.69, 0.69, 0.54, 0.41, 0.32, 0.21]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011110110000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 1.083088, lagrangian_loss: 0.075279, attention_score_distillation_loss: 0.039230
loss: 0.421221, lagrangian_loss: 0.087199, attention_score_distillation_loss: 0.038668
----------------------------------------------------------------------
time: 2023-07-19 15:09:54
Evaluating: accuracy: 0.6065, eval_loss: 2.6834, token_prune_loc: [False, True, False, True, False, True, True, True, True], macs_sparsity: 0.328, expected_sparsity: 0.3229, expected_sequence_sparsity: 0.8264, target_sparsity: 0.3628, step: 2400
lambda_1: -5.0757, lambda_2: 17.4417 lambda_3: 0.0000
train remain: [0.92 0.93 0.94 0.81 0.99 0.8 0.77 0.8 0.64]
infer remain: [1.0, 0.88, 1.0, 0.78, 1.0, 0.78, 0.76, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 1.0, 0.88, 0.88, 0.69, 0.69, 0.54, 0.41, 0.32, 0.2]
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101011111111111110111111010101100110011100111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101011111011111111111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110110111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.154039, lagrangian_loss: 0.091723, attention_score_distillation_loss: 0.037923
ETA: 0:32:24 | Epoch 30 finished. Took 39.57 seconds.
loss: 0.508581, lagrangian_loss: 0.101501, attention_score_distillation_loss: 0.037309
----------------------------------------------------------------------
time: 2023-07-19 15:10:19
Evaluating: accuracy: 0.5993, eval_loss: 2.736, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4416, expected_sparsity: 0.4334, expected_sequence_sparsity: 0.8549, target_sparsity: 0.3703, step: 2450
lambda_1: -5.5404, lambda_2: 17.7165 lambda_3: 0.0000
train remain: [0.91 0.93 0.94 0.81 0.99 0.8 0.77 0.8 0.64]
infer remain: [0.84, 0.88, 0.87, 0.78, 1.0, 0.78, 0.75, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 0.84, 0.74, 0.64, 0.5, 0.5, 0.39, 0.29, 0.23, 0.15]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111110
1110111011111111111110111111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111110111111111110111111111111110111101011111111101111111011101111101011100110
1111111111111111001111111111111111011101111101111001101010111111111110111111010101101110011100111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.715809, lagrangian_loss: 0.106383, attention_score_distillation_loss: 0.036697
loss: 0.003022, lagrangian_loss: 0.120150, attention_score_distillation_loss: 0.036098
ETA: 0:31:43 | Epoch 31 finished. Took 39.32 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:10:45
Evaluating: accuracy: 0.6426, eval_loss: 2.3733, token_prune_loc: [True, True, False, True, False, True, True, True, True], macs_sparsity: 0.416, expected_sparsity: 0.41, expected_sequence_sparsity: 0.8489, target_sparsity: 0.3779, step: 2500
lambda_1: -6.0449, lambda_2: 18.0453 lambda_3: 0.0000
train remain: [0.91 0.92 0.94 0.81 0.99 0.79 0.77 0.8 0.64]
infer remain: [0.83, 0.87, 1.0, 0.77, 1.0, 0.78, 0.75, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.72, 0.56, 0.56, 0.43, 0.33, 0.25, 0.16]
1111111111111011111111101111101011011011110111100110110110111011111111111101011011111111111111111110
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101010111111111110111111010101101110011000111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000111111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.011106, lagrangian_loss: 0.126886, attention_score_distillation_loss: 0.035442
loss: 0.010029, lagrangian_loss: 0.135964, attention_score_distillation_loss: 0.034778
----------------------------------------------------------------------
time: 2023-07-19 15:11:10
Evaluating: accuracy: 0.6173, eval_loss: 2.5042, token_prune_loc: [True, True, False, True, False, True, True, True, True], macs_sparsity: 0.416, expected_sparsity: 0.41, expected_sequence_sparsity: 0.8489, target_sparsity: 0.3855, step: 2550
lambda_1: -6.5762, lambda_2: 18.4166 lambda_3: 0.0000
train remain: [0.9 0.92 0.94 0.8 0.99 0.79 0.76 0.8 0.64]
infer remain: [0.83, 0.87, 1.0, 0.77, 1.0, 0.78, 0.75, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.72, 0.56, 0.56, 0.43, 0.33, 0.25, 0.16]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1111111111111111001111111111111111011101111101111001101010111111111110111111010101101110011000111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000011111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 1.012969, lagrangian_loss: 0.137797, attention_score_distillation_loss: 0.034087
ETA: 0:31:05 | Epoch 32 finished. Took 40.25 seconds.
loss: 0.011051, lagrangian_loss: 0.130916, attention_score_distillation_loss: 0.033475
----------------------------------------------------------------------
time: 2023-07-19 15:11:36
Evaluating: accuracy: 0.6426, eval_loss: 2.4858, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4528, expected_sparsity: 0.4453, expected_sequence_sparsity: 0.858, target_sparsity: 0.393, step: 2600
lambda_1: -7.0526, lambda_2: 18.7208 lambda_3: 0.0000
train remain: [0.9 0.91 0.93 0.79 0.99 0.79 0.76 0.79 0.64]
infer remain: [0.83, 0.87, 0.87, 0.76, 1.0, 0.78, 0.75, 0.78, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.63, 0.48, 0.48, 0.37, 0.28, 0.22, 0.14]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100
1111111111111111111111111110111111111110111111011111110111101111111111101111111011101111101011100110
1111111111111111001111111111111111011101111101111001101010111111111110111111010101100110011000111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100111111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110101000
0000011111111111111111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.407927, lagrangian_loss: 0.115509, attention_score_distillation_loss: 0.032848
loss: 0.010131, lagrangian_loss: 0.108897, attention_score_distillation_loss: 0.032199
----------------------------------------------------------------------
time: 2023-07-19 15:12:01
Evaluating: accuracy: 0.6101, eval_loss: 2.6151, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4567, expected_sparsity: 0.4515, expected_sequence_sparsity: 0.8596, target_sparsity: 0.4006, step: 2650
lambda_1: -7.4446, lambda_2: 18.9283 lambda_3: 0.0000
train remain: [0.89 0.91 0.92 0.78 0.99 0.79 0.75 0.79 0.64]
infer remain: [0.83, 0.87, 0.86, 0.75, 1.0, 0.78, 0.74, 0.77, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.72, 0.62, 0.47, 0.47, 0.36, 0.27, 0.21, 0.13]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011011100
1110111111111111111111111110111111111110111111011111110111101111111111101111111011101111101011100110
1111111111111111001111111111111111011101111101111001101010111111111110111111010101100110011000110000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011101111110111111111110110001100111111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110101000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.003942, lagrangian_loss: 0.101805, attention_score_distillation_loss: 0.031580
ETA: 0:30:27 | Epoch 33 finished. Took 40.92 seconds.
loss: 0.329970, lagrangian_loss: 0.106134, attention_score_distillation_loss: 0.030961
----------------------------------------------------------------------
time: 2023-07-19 15:12:27
Evaluating: accuracy: 0.6101, eval_loss: 2.5916, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4645, expected_sparsity: 0.4557, expected_sequence_sparsity: 0.8607, target_sparsity: 0.4082, step: 2700
lambda_1: -7.8081, lambda_2: 19.1057 lambda_3: 0.0000
train remain: [0.89 0.9 0.91 0.77 0.99 0.78 0.75 0.79 0.63]
infer remain: [0.83, 0.86, 0.86, 0.75, 1.0, 0.77, 0.74, 0.77, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.83, 0.71, 0.61, 0.46, 0.46, 0.35, 0.26, 0.2, 0.13]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111111111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011001100
1110111111111111111111111110111111111110111111011111110111101111111111101111111011101111101011100110
1111111111111111001111111111111111011101111101111001101010111101111110111111010101100110011000111000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100011111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110101000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.007095, lagrangian_loss: 0.094590, attention_score_distillation_loss: 0.030400
loss: 0.004053, lagrangian_loss: 0.105643, attention_score_distillation_loss: 0.029657
ETA: 0:29:46 | Epoch 34 finished. Took 39.21 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:12:52
Evaluating: accuracy: 0.6426, eval_loss: 2.4546, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4758, expected_sparsity: 0.4642, expected_sequence_sparsity: 0.8629, target_sparsity: 0.4157, step: 2750
lambda_1: -8.1299, lambda_2: 19.2449 lambda_3: 0.0000
train remain: [0.88 0.89 0.91 0.77 0.99 0.78 0.75 0.79 0.63]
infer remain: [0.82, 0.86, 0.85, 0.74, 1.0, 0.77, 0.74, 0.77, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.71, 0.6, 0.44, 0.44, 0.34, 0.25, 0.19, 0.12]
1111111111111011111111101111101011011011110111100110110110111011111111111101011011111111111111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011001100
1111111111111111111111111110111111111110111111011011110111101011111111101111111011101111101011100110
1111111111111111001111111111111111011101111101111001101010111101111110111111010101100110011000101000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100011111110100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.056042, lagrangian_loss: 0.097039, attention_score_distillation_loss: 0.029012
loss: 0.007306, lagrangian_loss: 0.076601, attention_score_distillation_loss: 0.028449
----------------------------------------------------------------------
time: 2023-07-19 15:13:18
Evaluating: accuracy: 0.6426, eval_loss: 2.4185, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4758, expected_sparsity: 0.4649, expected_sequence_sparsity: 0.8631, target_sparsity: 0.4233, step: 2800
lambda_1: -8.3977, lambda_2: 19.3408 lambda_3: 0.0000
train remain: [0.87 0.89 0.9 0.76 0.99 0.78 0.74 0.78 0.63]
infer remain: [0.82, 0.86, 0.85, 0.74, 1.0, 0.77, 0.73, 0.77, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.71, 0.6, 0.44, 0.44, 0.34, 0.25, 0.19, 0.12]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111111111101111011001100
1111111111111111111111111110111111111110111111011011110111101011111111101111111011101111101011100110
1111111111111111001101111111111111011101111101111001101010111111111110111111010101100110011000101000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100011111010100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.310369, lagrangian_loss: 0.078774, attention_score_distillation_loss: 0.027789
ETA: 0:29:07 | Epoch 35 finished. Took 39.96 seconds.
loss: 0.070664, lagrangian_loss: 0.093996, attention_score_distillation_loss: 0.027186
----------------------------------------------------------------------
time: 2023-07-19 15:13:43
Evaluating: accuracy: 0.6173, eval_loss: 2.5996, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4771, expected_sparsity: 0.47, expected_sequence_sparsity: 0.8644, target_sparsity: 0.4309, step: 2850
lambda_1: -8.6763, lambda_2: 19.4434 lambda_3: 0.0000
train remain: [0.87 0.89 0.9 0.75 0.99 0.78 0.74 0.78 0.63]
infer remain: [0.82, 0.85, 0.85, 0.73, 1.0, 0.77, 0.73, 0.77, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.7, 0.59, 0.43, 0.43, 0.33, 0.24, 0.19, 0.12]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111101111101111011001100
1111111111111111111111111110111111111110111111011011110111101111111111101011111011101111101011100110
1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000101000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111111101111011111111111111111111101001111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011111111110111111111110110001100011111010100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101111111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.007409, lagrangian_loss: 0.097927, attention_score_distillation_loss: 0.026529
loss: 0.004345, lagrangian_loss: 0.080982, attention_score_distillation_loss: 0.025926
ETA: 0:28:26 | Epoch 36 finished. Took 38.75 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:14:08
Evaluating: accuracy: 0.6209, eval_loss: 2.5957, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4831, expected_sparsity: 0.4754, expected_sequence_sparsity: 0.8658, target_sparsity: 0.4384, step: 2900
lambda_1: -8.9536, lambda_2: 19.5435 lambda_3: 0.0000
train remain: [0.86 0.88 0.9 0.74 0.99 0.78 0.73 0.78 0.63]
infer remain: [0.82, 0.85, 0.84, 0.72, 1.0, 0.77, 0.72, 0.76, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.82, 0.7, 0.59, 0.42, 0.42, 0.32, 0.23, 0.18, 0.11]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011111111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111101111101111011001100
1111111111111111111111111110111111111110111111011011110111101011111111111011101011101111101011100110
1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000100000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101011111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011101111110111111111110110001100011111010100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.002487, lagrangian_loss: 0.076532, attention_score_distillation_loss: 0.025307
loss: 0.041345, lagrangian_loss: 0.081311, attention_score_distillation_loss: 0.024667
----------------------------------------------------------------------
time: 2023-07-19 15:14:34
Evaluating: accuracy: 0.6173, eval_loss: 2.6695, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.487, expected_sparsity: 0.4802, expected_sequence_sparsity: 0.867, target_sparsity: 0.446, step: 2950
lambda_1: -9.2007, lambda_2: 19.6217 lambda_3: 0.0000
train remain: [0.85 0.88 0.89 0.74 0.99 0.78 0.73 0.78 0.63]
infer remain: [0.81, 0.85, 0.84, 0.72, 1.0, 0.76, 0.72, 0.76, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.81, 0.69, 0.58, 0.42, 0.42, 0.32, 0.23, 0.17, 0.11]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011110111010
1110111011111111111110101111111111111111110111111111011111111011110111111111111101111101111011001100
1111111111111111111111111110111111111110111111011011110111101111111111101011101011101111101011100110
1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000001000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011101111110111111111110110001100011111010100111011011100000100111011100
1111111101111111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.008462, lagrangian_loss: 0.074117, attention_score_distillation_loss: 0.024051
ETA: 0:27:47 | Epoch 37 finished. Took 40.15 seconds.
loss: 0.043701, lagrangian_loss: 0.069809, attention_score_distillation_loss: 0.023345
----------------------------------------------------------------------
time: 2023-07-19 15:14:59
Evaluating: accuracy: 0.6173, eval_loss: 2.699, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4961, expected_sparsity: 0.4872, expected_sequence_sparsity: 0.8689, target_sparsity: 0.4535, step: 3000
lambda_1: -9.4446, lambda_2: 19.6970 lambda_3: 0.0000
train remain: [0.85 0.87 0.88 0.73 0.99 0.77 0.73 0.77 0.63]
infer remain: [0.81, 0.84, 0.83, 0.71, 1.0, 0.76, 0.72, 0.76, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.81, 0.68, 0.56, 0.4, 0.4, 0.3, 0.22, 0.17, 0.11]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011110111010
1110111011111111110110101111111111111111110111111111011111111011110111111111111101111101111011001100
1111111111111111111111111110111111111110111111011011110111101011111111101011101011101111101011100110
1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011101111110111111111110110001100011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101111111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.114981, lagrangian_loss: 0.100112, attention_score_distillation_loss: 0.022715
loss: 0.519294, lagrangian_loss: 0.097500, attention_score_distillation_loss: 0.022118
ETA: 0:27:06 | Epoch 38 finished. Took 39.11 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:15:25
Evaluating: accuracy: 0.6245, eval_loss: 2.6877, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.4961, expected_sparsity: 0.49, expected_sequence_sparsity: 0.8696, target_sparsity: 0.4611, step: 3050
lambda_1: -9.7092, lambda_2: 19.7842 lambda_3: 0.0000
train remain: [0.84 0.87 0.88 0.73 0.99 0.77 0.72 0.77 0.63]
infer remain: [0.81, 0.84, 0.82, 0.71, 1.0, 0.76, 0.71, 0.76, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.81, 0.68, 0.56, 0.4, 0.4, 0.3, 0.21, 0.16, 0.1]
1111111111111011111111111111101011011011110111100110110110111011111111111101011011111111011110111010
1110111011111111111110101111111111111111110111111111011111111011110111111110111101111101111011001100
1111111111111111111111111110111111111110111111011011110111101011111111101010101011101111101011100110
1111111111111111001101111111111111011101111101111001101010111101111110111111010101100110011000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011101111110111111111110110001000011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101111111101110100000
0000011111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.341494, lagrangian_loss: 0.085704, attention_score_distillation_loss: 0.021495
loss: 0.255516, lagrangian_loss: 0.068332, attention_score_distillation_loss: 0.020899
----------------------------------------------------------------------
time: 2023-07-19 15:15:51
Evaluating: accuracy: 0.6029, eval_loss: 2.6045, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5078, expected_sparsity: 0.4994, expected_sequence_sparsity: 0.872, target_sparsity: 0.4687, step: 3100
lambda_1: -9.9610, lambda_2: 19.8629 lambda_3: 0.0000
train remain: [0.83 0.86 0.87 0.72 0.99 0.77 0.72 0.77 0.63]
infer remain: [0.8, 0.83, 0.82, 0.7, 1.0, 0.75, 0.71, 0.76, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.66, 0.54, 0.38, 0.38, 0.29, 0.2, 0.15, 0.1]
1111111111111011111111111111101011011011100111100110110110111011111111111101011011111111011110111010
1110111011111111110110101111111111111111110111111111011111111011110111111110111101111101111011001100
1111111111111111111111111110111111111110111111011011111111101011111111101010101011101011101011100110
1111111111111111001101111111111111011101111101111001101000111101111110111111010101100110011000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111101001111011111011111111001100011101011111110011011011101100
1011111111111011111111111111011101111110111111111110110001000011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101111111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.032344, lagrangian_loss: 0.065753, attention_score_distillation_loss: 0.020272
ETA: 0:26:28 | Epoch 39 finished. Took 40.91 seconds.
loss: 0.015267, lagrangian_loss: 0.076968, attention_score_distillation_loss: 0.019599
----------------------------------------------------------------------
time: 2023-07-19 15:16:16
Evaluating: accuracy: 0.5921, eval_loss: 2.7903, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5112, expected_sparsity: 0.5024, expected_sequence_sparsity: 0.8728, target_sparsity: 0.4762, step: 3150
lambda_1: -10.1977, lambda_2: 19.9312 lambda_3: 0.0000
train remain: [0.83 0.86 0.86 0.72 0.99 0.76 0.71 0.77 0.62]
infer remain: [0.8, 0.83, 0.81, 0.7, 1.0, 0.75, 0.7, 0.75, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.66, 0.54, 0.38, 0.38, 0.28, 0.2, 0.15, 0.09]
1111111111111011111111111111101011011011100111100110110110111011111111111101011011111111011110111010
1110111011111111111110101111111111111111110111111111011111111011110111111110111101111001111011001100
1111111111111111111111111111111111111110111111011011111111101011111111101010101011101000101011100110
1111111111111111001101111111111111011101111101111001101000111101111110111111010101100110011000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.002551, lagrangian_loss: 0.078639, attention_score_distillation_loss: 0.018954
loss: 0.004602, lagrangian_loss: 0.075491, attention_score_distillation_loss: 0.018321
ETA: 0:25:48 | Epoch 40 finished. Took 39.24 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:16:42
Evaluating: accuracy: 0.6282, eval_loss: 2.5548, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5151, expected_sparsity: 0.5062, expected_sequence_sparsity: 0.8738, target_sparsity: 0.4838, step: 3200
lambda_1: -10.4200, lambda_2: 19.9915 lambda_3: 0.0000
train remain: [0.83 0.85 0.84 0.71 0.99 0.76 0.71 0.77 0.62]
infer remain: [0.8, 0.83, 0.8, 0.69, 1.0, 0.75, 0.7, 0.75, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.8, 0.66, 0.53, 0.37, 0.37, 0.27, 0.19, 0.14, 0.09]
1111111111111011111111111111101011011011100111100110110110111011111111111101011011111111011110111010
1110111011111111111110101111111111111111110111111111011111111011110111111110111101111001111011001100
1111111111111111111111111111111111111110111111110011111111101011111111101010101010101000101011100110
1111111111111111001101111111111111011101111101111001101000111101111110111111010101100110010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111101001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.003761, lagrangian_loss: 0.053516, attention_score_distillation_loss: 0.017741
loss: 1.196991, lagrangian_loss: 0.075100, attention_score_distillation_loss: 0.017049
----------------------------------------------------------------------
time: 2023-07-19 15:17:08
Evaluating: accuracy: 0.6209, eval_loss: 2.5255, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5229, expected_sparsity: 0.5169, expected_sequence_sparsity: 0.8765, target_sparsity: 0.4914, step: 3250
lambda_1: -10.6231, lambda_2: 20.0405 lambda_3: 0.0000
train remain: [0.82 0.84 0.83 0.7 0.99 0.76 0.71 0.76 0.62]
infer remain: [0.79, 0.82, 0.79, 0.68, 1.0, 0.74, 0.7, 0.75, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.79, 0.65, 0.51, 0.35, 0.35, 0.26, 0.18, 0.14, 0.08]
1111111111111011111111111111101011011011100111100110110110111011111111110101011011111111011110111010
1110111011111111110110101111111111111111110111111111011111111011110111111110111101111001111011001100
1111111111111111111111111111111111111110111111010011111111101011111111101010101010101000101011100110
1111111111111111001101111111111111011101111101111001101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.015737, lagrangian_loss: 0.071713, attention_score_distillation_loss: 0.016435
loss: 0.056410, lagrangian_loss: 0.081059, attention_score_distillation_loss: 0.015840
ETA: 0:25:09 | Epoch 41 finished. Took 40.19 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:17:33
Evaluating: accuracy: 0.6173, eval_loss: 2.5088, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5229, expected_sparsity: 0.5169, expected_sequence_sparsity: 0.8765, target_sparsity: 0.4989, step: 3300
lambda_1: -10.8754, lambda_2: 20.1152 lambda_3: 0.0000
train remain: [0.82 0.84 0.83 0.7 0.99 0.75 0.71 0.76 0.62]
infer remain: [0.79, 0.82, 0.79, 0.68, 1.0, 0.74, 0.7, 0.75, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.79, 0.65, 0.51, 0.35, 0.35, 0.26, 0.18, 0.14, 0.08]
1111111111111011111111111111101011011011100111100110110110111011111111110101011011111111011110111010
1110111011111111110110101111111111111111110111111111011111111011110111111110111101111001111011001100
1111111111111111111111111111111111111110111111010011111111101011111111101010101010101000101011100110
1111111111111111001101111111111111011101111101111001101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100111011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.076251, lagrangian_loss: 0.100739, attention_score_distillation_loss: 0.015197
loss: 0.350246, lagrangian_loss: 0.128922, attention_score_distillation_loss: 0.014505
----------------------------------------------------------------------
time: 2023-07-19 15:17:59
Evaluating: accuracy: 0.5993, eval_loss: 2.6515, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5346, expected_sparsity: 0.5269, expected_sequence_sparsity: 0.8791, target_sparsity: 0.5065, step: 3350
lambda_1: -11.2008, lambda_2: 20.2370 lambda_3: 0.0000
train remain: [0.82 0.83 0.82 0.69 0.99 0.75 0.7 0.76 0.62]
infer remain: [0.78, 0.81, 0.78, 0.67, 1.0, 0.74, 0.69, 0.75, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.63, 0.49, 0.33, 0.33, 0.24, 0.17, 0.13, 0.08]
1111111111111011111111111111101011011011100111100110110110111011111111110101011011110111011110111010
1110111011111111110110101111111111111111110111111111011111111011110110111110111101111001111011001100
1111111111111111111111111111111111111110111111010011111111101011101111101010101010101000101011100110
1111111111111111001101111111111111011001111101111001101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100101011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111111111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.006544, lagrangian_loss: 0.143903, attention_score_distillation_loss: 0.013892
ETA: 0:24:30 | Epoch 42 finished. Took 40.78 seconds.
loss: 0.301119, lagrangian_loss: 0.108145, attention_score_distillation_loss: 0.013329
----------------------------------------------------------------------
time: 2023-07-19 15:18:24
Evaluating: accuracy: 0.639, eval_loss: 2.4096, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5359, expected_sparsity: 0.5298, expected_sequence_sparsity: 0.8799, target_sparsity: 0.5141, step: 3400
lambda_1: -11.5582, lambda_2: 20.3831 lambda_3: 0.0000
train remain: [0.81 0.82 0.81 0.69 0.99 0.75 0.7 0.76 0.62]
infer remain: [0.78, 0.81, 0.77, 0.67, 1.0, 0.73, 0.69, 0.74, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.63, 0.49, 0.33, 0.33, 0.24, 0.16, 0.12, 0.08]
1111111111111011111111111111101011011011100111100110110110111011111111110101011011110111011110111010
1110111011111111111110101111111111111111110111111111011111111011110100111110111101111001111011001100
1111111111111111111111111111111111111110111111010011111111101011101111101010101010101000001011100110
1111111111111111001101111111111111011001111101111001101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100101011011100000100111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.014456, lagrangian_loss: 0.141687, attention_score_distillation_loss: 0.012656
loss: 0.005534, lagrangian_loss: 0.112436, attention_score_distillation_loss: 0.012044
ETA: 0:23:50 | Epoch 43 finished. Took 39.39 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:18:50
Evaluating: accuracy: 0.6173, eval_loss: 2.5429, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5445, expected_sparsity: 0.5361, expected_sequence_sparsity: 0.8815, target_sparsity: 0.5216, step: 3450
lambda_1: -11.9156, lambda_2: 20.5289 lambda_3: 0.0000
train remain: [0.8 0.82 0.8 0.68 0.99 0.74 0.69 0.76 0.62]
infer remain: [0.78, 0.8, 0.76, 0.66, 1.0, 0.73, 0.68, 0.74, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.78, 0.62, 0.47, 0.31, 0.31, 0.23, 0.16, 0.11, 0.07]
1111111111111011111111111111101011011011100111100110110110111011111111110101011011110111011110111010
1110111011111111110110101111111111111111110111111111011111111011110100111110111101111001111011001100
1111111111111111111111111111111111111110111111010011111111101011101111101010101010101000001010100110
1111111111111111001101111111111111011001111101111000101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011101000
1011111111111011111111111111011101101110111111111110110001000011111010100101011011100000000111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010111010100010000000000
loss: 0.003931, lagrangian_loss: 0.097588, attention_score_distillation_loss: 0.011439
loss: 0.007178, lagrangian_loss: 0.115671, attention_score_distillation_loss: 0.010769
----------------------------------------------------------------------
time: 2023-07-19 15:19:16
Evaluating: accuracy: 0.6029, eval_loss: 2.6406, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5523, expected_sparsity: 0.5413, expected_sequence_sparsity: 0.8828, target_sparsity: 0.5292, step: 3500
lambda_1: -12.2549, lambda_2: 20.6603 lambda_3: 0.0000
train remain: [0.79 0.81 0.79 0.67 0.99 0.74 0.69 0.75 0.62]
infer remain: [0.77, 0.8, 0.75, 0.66, 1.0, 0.73, 0.68, 0.74, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.77, 0.62, 0.46, 0.3, 0.3, 0.22, 0.15, 0.11, 0.07]
1111111111111011111111111111101011011011100111100110110110111011111111110101011011010111011110111010
1110111011111111110110101111111111111111110111111111011111111011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101110011111111101011101111101010101010101000001010100110
1111111111111111001101111111111111011001111101111000101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111011111011111111001100011101011111110011011011100000
1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000100111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.005086, lagrangian_loss: 0.122999, attention_score_distillation_loss: 0.010151
ETA: 0:23:11 | Epoch 44 finished. Took 40.49 seconds.
loss: 0.011251, lagrangian_loss: 0.114003, attention_score_distillation_loss: 0.009537
----------------------------------------------------------------------
time: 2023-07-19 15:19:41
Evaluating: accuracy: 0.6282, eval_loss: 2.494, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5609, expected_sparsity: 0.5507, expected_sequence_sparsity: 0.8852, target_sparsity: 0.5367, step: 3550
lambda_1: -12.5875, lambda_2: 20.7863 lambda_3: 0.0000
train remain: [0.79 0.81 0.78 0.67 0.98 0.73 0.69 0.75 0.61]
infer remain: [0.76, 0.79, 0.74, 0.65, 1.0, 0.72, 0.68, 0.74, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.6, 0.44, 0.29, 0.29, 0.21, 0.14, 0.1, 0.06]
1111111111111011111111111111101011011011100111100010110110111011111111111101011011010111011100111010
1110111011111111110110101111111111111111110111111111011110111011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101010011111111101011101111101010101010101000001010100110
1111111111111111001101111111111011011001111101111000101000111101111110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011100000
1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000100111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.068712, lagrangian_loss: 0.109909, attention_score_distillation_loss: 0.008896
loss: 0.011009, lagrangian_loss: 0.136202, attention_score_distillation_loss: 0.008250
ETA: 0:22:30 | Epoch 45 finished. Took 39.16 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:20:07
Evaluating: accuracy: 0.639, eval_loss: 2.3263, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5648, expected_sparsity: 0.5543, expected_sequence_sparsity: 0.8861, target_sparsity: 0.5443, step: 3600
lambda_1: -12.9416, lambda_2: 20.9284 lambda_3: 0.0000
train remain: [0.78 0.81 0.77 0.66 0.98 0.73 0.68 0.75 0.61]
infer remain: [0.76, 0.79, 0.73, 0.64, 1.0, 0.72, 0.67, 0.74, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.76, 0.6, 0.44, 0.28, 0.28, 0.2, 0.14, 0.1, 0.06]
1111111111111011111111111111101011011011100111100010110110111011111111111101011011010111011100111010
1110111011111111110110101111111111111111110111111111011110111011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101010011111111001011101111101010101010101000001010100110
1111111111111111001101111111111011011001111101111000101000111101011110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011100000
1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.744691, lagrangian_loss: 0.128177, attention_score_distillation_loss: 0.007626
loss: 0.006925, lagrangian_loss: 0.141114, attention_score_distillation_loss: 0.007001
----------------------------------------------------------------------
time: 2023-07-19 15:20:33
Evaluating: accuracy: 0.6643, eval_loss: 2.3173, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5661, expected_sparsity: 0.5595, expected_sequence_sparsity: 0.8875, target_sparsity: 0.5519, step: 3650
lambda_1: -13.3363, lambda_2: 21.1052 lambda_3: 0.0000
train remain: [0.77 0.81 0.76 0.65 0.98 0.73 0.68 0.75 0.61]
infer remain: [0.75, 0.78, 0.73, 0.64, 1.0, 0.72, 0.67, 0.74, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.75, 0.58, 0.43, 0.27, 0.27, 0.2, 0.13, 0.1, 0.06]
1111111111111011111111111111101011011011100111100010110110111011111111110101011011010111011100111010
1110111011111111110110101111111111111111110111111111001110111011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101010011111111001011101111101010101010101000001010100110
1111111111111111001101111111111011011001111101111000101000111101011110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011100000
1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.009152, lagrangian_loss: 0.186133, attention_score_distillation_loss: 0.006354
ETA: 0:21:51 | Epoch 46 finished. Took 40.65 seconds.
loss: 0.004601, lagrangian_loss: 0.197659, attention_score_distillation_loss: 0.005716
----------------------------------------------------------------------
time: 2023-07-19 15:20:58
Evaluating: accuracy: 0.6065, eval_loss: 2.5811, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5752, expected_sparsity: 0.5664, expected_sequence_sparsity: 0.8893, target_sparsity: 0.5594, step: 3700
lambda_1: -13.8023, lambda_2: 21.3534 lambda_3: 0.0000
train remain: [0.77 0.8 0.75 0.65 0.98 0.72 0.68 0.75 0.61]
infer remain: [0.74, 0.78, 0.72, 0.63, 1.0, 0.71, 0.67, 0.73, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.58, 0.42, 0.26, 0.26, 0.19, 0.12, 0.09, 0.06]
1111111111111011111111111111101011011011100111100010110110111011111111110101010011010111011100111010
1110111011111111110110101111111111111111110111111111001110111011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101010011111111001011101111101010101000101000001010100110
1111111111111111001001111111111011011001111101111000101000111101011110111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011000000
1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110110101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.413352, lagrangian_loss: 0.187738, attention_score_distillation_loss: 0.005107
loss: 0.004332, lagrangian_loss: 0.207729, attention_score_distillation_loss: 0.004463
ETA: 0:21:11 | Epoch 47 finished. Took 39.42 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:21:24
Evaluating: accuracy: 0.6065, eval_loss: 2.67, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5804, expected_sparsity: 0.5701, expected_sequence_sparsity: 0.8902, target_sparsity: 0.567, step: 3750
lambda_1: -14.2933, lambda_2: 21.6325 lambda_3: 0.0000
train remain: [0.76 0.79 0.74 0.64 0.98 0.72 0.67 0.74 0.61]
infer remain: [0.74, 0.77, 0.71, 0.63, 1.0, 0.71, 0.67, 0.73, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.74, 0.57, 0.4, 0.25, 0.25, 0.18, 0.12, 0.09, 0.05]
1111111111111011111111111111101011011011100111100010110110111011111111110101010011010111011100111010
1110111011111111110110101111111111111111110111111111001110011011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101010011111111001011101111100010101000101000001010100110
1111111111111111001101111111111011011001111101111000101000111101011100111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011000000
1011111111111011111111111111011101101110111111111110110001000011111010100101011001100000000111011100
1111111101011111011111111101111011101011111011111111100011111011111001100110110101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.003369, lagrangian_loss: 0.189280, attention_score_distillation_loss: 0.003839
loss: 0.049288, lagrangian_loss: 0.191961, attention_score_distillation_loss: 0.003212
----------------------------------------------------------------------
time: 2023-07-19 15:21:50
Evaluating: accuracy: 0.6318, eval_loss: 2.4441, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5856, expected_sparsity: 0.5762, expected_sequence_sparsity: 0.8918, target_sparsity: 0.5746, step: 3800
lambda_1: -14.8084, lambda_2: 21.9450 lambda_3: 0.0000
train remain: [0.75 0.79 0.73 0.64 0.98 0.72 0.67 0.74 0.61]
infer remain: [0.73, 0.77, 0.7, 0.62, 1.0, 0.71, 0.66, 0.73, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.73, 0.56, 0.39, 0.24, 0.24, 0.17, 0.11, 0.08, 0.05]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100111010
1110111011111111110110101111111111111111110111111111001110011011110100111110111101111001111011001100
1111111111111111111111111110111111111110111101010011111111001011101111100010101000100000001010100110
1111111111111111001001111111111011011001111101111000101000111101011100111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111011111011111111001100011101011111110011011011000000
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111011100
1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.003779, lagrangian_loss: 0.208335, attention_score_distillation_loss: 0.002579
ETA: 0:20:32 | Epoch 48 finished. Took 40.81 seconds.
loss: 0.149992, lagrangian_loss: 0.221737, attention_score_distillation_loss: 0.001945
----------------------------------------------------------------------
time: 2023-07-19 15:22:15
Evaluating: accuracy: 0.6318, eval_loss: 2.5037, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5882, expected_sparsity: 0.5802, expected_sequence_sparsity: 0.8929, target_sparsity: 0.5821, step: 3850
lambda_1: -15.3436, lambda_2: 22.2901 lambda_3: 0.0000
train remain: [0.74 0.78 0.72 0.63 0.98 0.71 0.66 0.74 0.61]
infer remain: [0.73, 0.76, 0.69, 0.62, 1.0, 0.7, 0.66, 0.73, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.73, 0.55, 0.38, 0.24, 0.24, 0.17, 0.11, 0.08, 0.05]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100111010
1110111011111111110110101111111111111111110111111111001110011011110100111110111101111001011011001100
1111111111111111111111111110111111111110111101010011110111001011101111100010101000100000001010100110
1111111111111111001001111111111011011001111101111000101000111101011100111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000000
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111011100
1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
loss: 0.004148, lagrangian_loss: 0.271993, attention_score_distillation_loss: 0.001310
loss: 0.276304, lagrangian_loss: 0.256949, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:22:41
Evaluating: accuracy: 0.6318, eval_loss: 2.5379, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5946, expected_sparsity: 0.588, expected_sequence_sparsity: 0.8949, target_sparsity: 0.5897, step: 3900
lambda_1: -15.9210, lambda_2: 22.7007 lambda_3: 0.0000
train remain: [0.74 0.78 0.71 0.62 0.98 0.71 0.66 0.74 0.61]
infer remain: [0.72, 0.75, 0.68, 0.61, 1.0, 0.7, 0.65, 0.73, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.54, 0.37, 0.22, 0.22, 0.16, 0.1, 0.07, 0.05]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100101010
1110111011111111110110101111111111111111110111111111001110011011110100111110101101111001011011001100
1111111111111111111111111110111111111110111101010010110111001011101111100010101000100000001010100110
1111111111111111001001111111111011011001111101111000101000110101011100111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111011100
1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
ETA: 0:19:53 | Epoch 49 finished. Took 40.6 seconds.
loss: 0.038694, lagrangian_loss: 0.288122, attention_score_distillation_loss: 0.000984
loss: 0.004382, lagrangian_loss: 0.260619, attention_score_distillation_loss: 0.000984
Starting saving the best from epoch 50 and step 3950
----------------------------------------------------------------------
time: 2023-07-19 15:23:07
Evaluating: accuracy: 0.6643, eval_loss: 2.2208, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.602, expected_sparsity: 0.5936, expected_sequence_sparsity: 0.8963, target_sparsity: 0.59, step: 3950
lambda_1: -16.4693, lambda_2: 23.0853 lambda_3: 0.0000
train remain: [0.73 0.77 0.7 0.62 0.97 0.71 0.66 0.74 0.61]
infer remain: [0.71, 0.75, 0.67, 0.6, 1.0, 0.7, 0.65, 0.72, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.36, 0.21, 0.21, 0.15, 0.1, 0.07, 0.04]
1111111111111011111111101111100011011011100111100010110110111011111111110101010011010111011100101010
1110111011111111110110101111111111111111110111111111001110011011110100111110101101111001011011001100
1111111111111111111111111110111111111100111101010010110111001011101111100010101000100000001010100110
1111111111111111001001111111101011011001111101111000101000110101011100111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111011100
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Saving the best model so far: [Epoch 50 | Step: 3950 | MACs sparsity: 0.602 | Score: 0.6643 | Loss: 2.2208]
loss: 0.422135, lagrangian_loss: 0.207163, attention_score_distillation_loss: 0.000983
loss: 0.951814, lagrangian_loss: 0.147679, attention_score_distillation_loss: 0.000985
ETA: 0:19:45 | Epoch 50 finished. Took 95.03 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:24:29
Evaluating: accuracy: 0.6065, eval_loss: 2.5588, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6046, expected_sparsity: 0.5973, expected_sequence_sparsity: 0.8972, target_sparsity: 0.59, step: 4000
lambda_1: -16.8111, lambda_2: 23.2405 lambda_3: 0.0000
train remain: [0.73 0.76 0.69 0.61 0.97 0.7 0.66 0.73 0.6 ]
infer remain: [0.71, 0.74, 0.66, 0.6, 1.0, 0.69, 0.65, 0.72, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.35, 0.21, 0.21, 0.14, 0.09, 0.07, 0.04]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011100001010
1110111011111111110110101111111111111111110111111111001110011011110100111110101101011001011011001100
1011111111111111111111111110111111111110111101010010110111001011101101100010101000101000001010100010
1111111111111111001001111111101011011001111101111000101000110101011100111111010101100100010000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001100011101011111110011011011000000
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010100
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.336843, lagrangian_loss: 0.106470, attention_score_distillation_loss: 0.000985
loss: 0.005895, lagrangian_loss: 0.057680, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:24:54
Evaluating: accuracy: 0.6498, eval_loss: 2.2217, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6123, expected_sparsity: 0.6042, expected_sequence_sparsity: 0.899, target_sparsity: 0.59, step: 4050
lambda_1: -16.9701, lambda_2: 23.2784 lambda_3: 0.0000
train remain: [0.72 0.76 0.69 0.6 0.97 0.7 0.65 0.73 0.6 ]
infer remain: [0.7, 0.73, 0.65, 0.59, 1.0, 0.69, 0.65, 0.72, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.51, 0.33, 0.2, 0.2, 0.14, 0.09, 0.06, 0.04]
1111111111111011111111101111100011011011100111100010110110111011111111110101010011010111011100001010
1110111011111111110110101111111111111111110111111111001110011011110100111110100101011001011011001100
1011111111111111111111111110111111111110111101010010110111001011101101100010101000101000001010000010
1111111111111111001001111111101011011001111101111000101000110101011100111111010101100100000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001100011101011111110011011011000000
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010100
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.431041, lagrangian_loss: 0.012984, attention_score_distillation_loss: 0.000986
ETA: 0:19:04 | Epoch 51 finished. Took 40.73 seconds.
loss: 0.450005, lagrangian_loss: -0.025127, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:25:20
Evaluating: accuracy: 0.6282, eval_loss: 2.4345, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6149, expected_sparsity: 0.6068, expected_sequence_sparsity: 0.8997, target_sparsity: 0.59, step: 4100
lambda_1: -16.9393, lambda_2: 23.2879 lambda_3: 0.0000
train remain: [0.71 0.75 0.67 0.6 0.97 0.7 0.65 0.73 0.6 ]
infer remain: [0.7, 0.73, 0.64, 0.58, 1.0, 0.69, 0.64, 0.72, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.51, 0.33, 0.19, 0.19, 0.13, 0.08, 0.06, 0.04]
1111111111111011111111101111100011011011100111100010110110111011111111110101010011010111011100001010
1110111011111111110110101111111111111111110111111111001110011011110100111110100101011001011011001100
1011111111111111111111111110111111111100111101010010110111001011101101100010101000101000001010000010
1101111111111111001001111111101011011001111101111000101000110101011100111111010101100100000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001100011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.005654, lagrangian_loss: -0.076509, attention_score_distillation_loss: 0.000987
loss: 0.266467, lagrangian_loss: -0.118968, attention_score_distillation_loss: 0.000986
ETA: 0:18:22 | Epoch 52 finished. Took 39.02 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:25:45
Evaluating: accuracy: 0.6065, eval_loss: 2.6424, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6197, expected_sparsity: 0.6115, expected_sequence_sparsity: 0.9009, target_sparsity: 0.59, step: 4150
lambda_1: -16.6850, lambda_2: 23.3727 lambda_3: 0.0000
train remain: [0.71 0.74 0.67 0.59 0.96 0.69 0.65 0.73 0.6 ]
infer remain: [0.69, 0.72, 0.64, 0.58, 1.0, 0.68, 0.64, 0.72, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.69, 0.5, 0.32, 0.18, 0.18, 0.13, 0.08, 0.06, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010011010101011100001010
1110111011111111110110100111111111111111110111111111001110011011110100111110100101011001011011001100
1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000010
1101111111111111001001111111101011011001111101111000101000110101011100111111010101100100000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.012612, lagrangian_loss: -0.160934, attention_score_distillation_loss: 0.000975
loss: 0.478241, lagrangian_loss: -0.205715, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:26:10
Evaluating: accuracy: 0.6318, eval_loss: 2.3875, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6287, expected_sparsity: 0.6179, expected_sequence_sparsity: 0.9026, target_sparsity: 0.59, step: 4200
lambda_1: -16.2349, lambda_2: 23.6216 lambda_3: 0.0000
train remain: [0.7 0.74 0.66 0.58 0.96 0.69 0.64 0.73 0.6 ]
infer remain: [0.68, 0.71, 0.63, 0.57, 1.0, 0.68, 0.64, 0.72, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.17, 0.17, 0.12, 0.08, 0.05, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011100001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100101011001011011001100
1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000010
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100100000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.005949, lagrangian_loss: -0.250325, attention_score_distillation_loss: 0.000984
ETA: 0:17:41 | Epoch 53 finished. Took 40.55 seconds.
loss: 0.025485, lagrangian_loss: -0.296913, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:26:36
Evaluating: accuracy: 0.6245, eval_loss: 2.4973, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6313, expected_sparsity: 0.6204, expected_sequence_sparsity: 0.9032, target_sparsity: 0.59, step: 4250
lambda_1: -15.5835, lambda_2: 24.1509 lambda_3: 0.0000
train remain: [0.69 0.73 0.65 0.58 0.96 0.69 0.64 0.72 0.6 ]
infer remain: [0.68, 0.71, 0.62, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.17, 0.17, 0.11, 0.07, 0.05, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011100001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100101011001011011001100
1011111111111111111111111110111111111100111101010010110111001011101101100010001000100000001010000010
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.003920, lagrangian_loss: -0.309019, attention_score_distillation_loss: 0.000987
loss: 0.159576, lagrangian_loss: -0.334167, attention_score_distillation_loss: 0.000987
ETA: 0:17:00 | Epoch 54 finished. Took 39.46 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:27:02
Evaluating: accuracy: 0.6606, eval_loss: 2.3501, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6326, expected_sparsity: 0.6245, expected_sequence_sparsity: 0.9043, target_sparsity: 0.59, step: 4300
lambda_1: -14.8043, lambda_2: 24.9288 lambda_3: 0.0000
train remain: [0.69 0.73 0.65 0.57 0.96 0.69 0.64 0.72 0.6 ]
infer remain: [0.67, 0.7, 0.62, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.47, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011000001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100001011001011011001100
1011111111111111111111111110111111111100111101010010110111001011101101100010001000100000001010000010
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.003980, lagrangian_loss: -0.337197, attention_score_distillation_loss: 0.000985
loss: 0.914364, lagrangian_loss: -0.357105, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:27:28
Evaluating: accuracy: 0.6318, eval_loss: 2.4607, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6326, expected_sparsity: 0.6257, expected_sequence_sparsity: 0.9046, target_sparsity: 0.59, step: 4350
lambda_1: -13.9215, lambda_2: 25.9556 lambda_3: 0.0000
train remain: [0.68 0.72 0.64 0.57 0.95 0.68 0.64 0.72 0.6 ]
infer remain: [0.67, 0.7, 0.61, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.47, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101011000001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100001011001011011001100
1011111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.004179, lagrangian_loss: -0.348407, attention_score_distillation_loss: 0.000986
ETA: 0:16:19 | Epoch 55 finished. Took 40.31 seconds.
loss: 0.003455, lagrangian_loss: -0.360717, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:27:53
Evaluating: accuracy: 0.6498, eval_loss: 2.3405, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 4400
lambda_1: -12.9876, lambda_2: 27.1190 lambda_3: 0.0000
train remain: [0.68 0.72 0.64 0.56 0.95 0.68 0.64 0.72 0.6 ]
infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100001001001011011001100
1011111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010
1101111111111111001001111111101011011001111101111000100000110101010100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.004634, lagrangian_loss: -0.363920, attention_score_distillation_loss: 0.000985
loss: 0.540920, lagrangian_loss: -0.363997, attention_score_distillation_loss: 0.000985
ETA: 0:15:37 | Epoch 56 finished. Took 39.62 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:28:19
Evaluating: accuracy: 0.6137, eval_loss: 2.4352, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 4450
lambda_1: -12.0277, lambda_2: 28.3537 lambda_3: 0.0000
train remain: [0.67 0.71 0.63 0.56 0.95 0.68 0.64 0.72 0.6 ]
infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03]
1111111111111011111111101111100011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100001001001011011001100
1011111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010
1101111111111111001001111111101011011001111101111000100000110101010100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.128388, lagrangian_loss: -0.345102, attention_score_distillation_loss: 0.000982
loss: 0.037993, lagrangian_loss: -0.356329, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:28:45
Evaluating: accuracy: 0.6173, eval_loss: 2.4553, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.643, expected_sparsity: 0.6344, expected_sequence_sparsity: 0.9068, target_sparsity: 0.59, step: 4500
lambda_1: -11.0517, lambda_2: 29.6301 lambda_3: 0.0000
train remain: [0.67 0.71 0.63 0.56 0.95 0.68 0.63 0.72 0.6 ]
infer remain: [0.65, 0.69, 0.6, 0.55, 1.0, 0.67, 0.63, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.45, 0.27, 0.15, 0.15, 0.1, 0.06, 0.04, 0.03]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100111110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010
1101111111111111001001111111101011011001101101111000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.003925, lagrangian_loss: -0.350756, attention_score_distillation_loss: 0.000985
ETA: 0:14:56 | Epoch 57 finished. Took 40.8 seconds.
loss: 0.009974, lagrangian_loss: -0.340278, attention_score_distillation_loss: 0.000983
----------------------------------------------------------------------
time: 2023-07-19 15:29:11
Evaluating: accuracy: 0.6318, eval_loss: 2.4038, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6468, expected_sparsity: 0.6367, expected_sequence_sparsity: 0.9074, target_sparsity: 0.59, step: 4550
lambda_1: -10.0790, lambda_2: 30.8959 lambda_3: 0.0000
train remain: [0.66 0.71 0.62 0.55 0.94 0.68 0.63 0.72 0.6 ]
infer remain: [0.65, 0.68, 0.6, 0.54, 1.0, 0.67, 0.63, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.14, 0.1, 0.06, 0.04, 0.03]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001101101100010001000100000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.071053, lagrangian_loss: -0.338397, attention_score_distillation_loss: 0.000986
loss: 0.004020, lagrangian_loss: -0.309773, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:29:37
Evaluating: accuracy: 0.6245, eval_loss: 2.4036, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6468, expected_sparsity: 0.6369, expected_sequence_sparsity: 0.9074, target_sparsity: 0.59, step: 4600
lambda_1: -9.1241, lambda_2: 32.1191 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.94 0.68 0.63 0.72 0.59]
infer remain: [0.65, 0.68, 0.6, 0.54, 1.0, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.14, 0.1, 0.06, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.011813, lagrangian_loss: -0.303010, attention_score_distillation_loss: 0.000985
ETA: 0:14:16 | Epoch 58 finished. Took 40.74 seconds.
loss: 0.389025, lagrangian_loss: -0.289140, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:30:02
Evaluating: accuracy: 0.6173, eval_loss: 2.4259, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6379, expected_sequence_sparsity: 0.9077, target_sparsity: 0.59, step: 4650
lambda_1: -8.1995, lambda_2: 33.2752 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.94 0.68 0.63 0.72 0.59]
infer remain: [0.65, 0.68, 0.59, 0.54, 1.0, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.14, 0.09, 0.06, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.005033, lagrangian_loss: -0.275261, attention_score_distillation_loss: 0.000986
loss: 0.024751, lagrangian_loss: -0.249146, attention_score_distillation_loss: 0.000983
ETA: 0:13:34 | Epoch 59 finished. Took 39.4 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:30:28
Evaluating: accuracy: 0.6173, eval_loss: 2.492, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6379, expected_sequence_sparsity: 0.9077, target_sparsity: 0.59, step: 4700
lambda_1: -7.3059, lambda_2: 34.3695 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.72 0.59]
infer remain: [0.65, 0.68, 0.59, 0.54, 1.0, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.14, 0.09, 0.06, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.006469, lagrangian_loss: -0.239501, attention_score_distillation_loss: 0.000986
loss: 0.007700, lagrangian_loss: -0.224145, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:30:54
Evaluating: accuracy: 0.6137, eval_loss: 2.4349, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 4750
lambda_1: -6.4449, lambda_2: 35.4011 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.93 0.67 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.483426, lagrangian_loss: -0.201096, attention_score_distillation_loss: 0.000986
ETA: 0:12:54 | Epoch 60 finished. Took 40.94 seconds.
loss: 0.333813, lagrangian_loss: -0.191403, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:31:20
Evaluating: accuracy: 0.6173, eval_loss: 2.5087, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 4800
lambda_1: -5.6111, lambda_2: 36.3844 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.93 0.67 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011110111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011100001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.002886, lagrangian_loss: -0.173184, attention_score_distillation_loss: 0.000986
loss: 0.002963, lagrangian_loss: -0.152122, attention_score_distillation_loss: 0.000986
ETA: 0:12:13 | Epoch 61 finished. Took 38.96 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:31:45
Evaluating: accuracy: 0.6498, eval_loss: 2.2829, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 4850
lambda_1: -4.8085, lambda_2: 37.3120 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.54 0.93 0.67 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.003775, lagrangian_loss: -0.137805, attention_score_distillation_loss: 0.000985
loss: 0.453928, lagrangian_loss: -0.117485, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:32:11
Evaluating: accuracy: 0.6245, eval_loss: 2.5208, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.652, expected_sparsity: 0.6448, expected_sequence_sparsity: 0.9095, target_sparsity: 0.59, step: 4900
lambda_1: -4.0345, lambda_2: 38.1898 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.54 0.93 0.67 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.53, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.03, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.013513, lagrangian_loss: -0.100817, attention_score_distillation_loss: 0.000985
ETA: 0:11:32 | Epoch 62 finished. Took 41.15 seconds.
loss: 0.003451, lagrangian_loss: -0.085410, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:32:37
Evaluating: accuracy: 0.5957, eval_loss: 2.5711, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.652, expected_sparsity: 0.6448, expected_sequence_sparsity: 0.9095, target_sparsity: 0.59, step: 4950
lambda_1: -3.2855, lambda_2: 39.0262 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.54 0.93 0.68 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.53, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.03, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.320035, lagrangian_loss: -0.069144, attention_score_distillation_loss: 0.000986
loss: 0.002735, lagrangian_loss: -0.052947, attention_score_distillation_loss: 0.000985
ETA: 0:10:51 | Epoch 63 finished. Took 39.15 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:33:02
Evaluating: accuracy: 0.6173, eval_loss: 2.3944, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.652, expected_sparsity: 0.6448, expected_sequence_sparsity: 0.9095, target_sparsity: 0.59, step: 5000
lambda_1: -2.5592, lambda_2: 39.8266 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.54 0.93 0.68 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.53, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.03, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.003159, lagrangian_loss: -0.037125, attention_score_distillation_loss: 0.000986
loss: 0.006786, lagrangian_loss: -0.021828, attention_score_distillation_loss: 0.000987
----------------------------------------------------------------------
time: 2023-07-19 15:33:28
Evaluating: accuracy: 0.6137, eval_loss: 2.45, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 5050
lambda_1: -1.8595, lambda_2: 40.5815 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000100000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100101111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.008101, lagrangian_loss: -0.011321, attention_score_distillation_loss: 0.000981
ETA: 0:10:10 | Epoch 64 finished. Took 40.78 seconds.
loss: 0.443700, lagrangian_loss: 0.007804, attention_score_distillation_loss: 0.000987
----------------------------------------------------------------------
time: 2023-07-19 15:33:53
Evaluating: accuracy: 0.6209, eval_loss: 2.407, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6442, expected_sequence_sparsity: 0.9093, target_sparsity: 0.59, step: 5100
lambda_1: -1.1800, lambda_2: 41.3047 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.71 0.59]
infer remain: [0.64, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.64, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111110110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000001000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.287923, lagrangian_loss: 0.021347, attention_score_distillation_loss: 0.000987
loss: 0.003075, lagrangian_loss: 0.028340, attention_score_distillation_loss: 0.000984
ETA: 0:09:29 | Epoch 65 finished. Took 39.07 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:34:19
Evaluating: accuracy: 0.6282, eval_loss: 2.315, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6419, expected_sequence_sparsity: 0.9087, target_sparsity: 0.59, step: 5150
lambda_1: -0.5346, lambda_2: 41.9669 lambda_3: 0.0000
train remain: [0.66 0.7 0.62 0.55 0.93 0.68 0.63 0.71 0.59]
infer remain: [0.65, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000001000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.280213, lagrangian_loss: 0.041546, attention_score_distillation_loss: 0.000984
loss: 0.003004, lagrangian_loss: 0.051726, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:34:45
Evaluating: accuracy: 0.6318, eval_loss: 2.277, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6507, expected_sparsity: 0.6419, expected_sequence_sparsity: 0.9087, target_sparsity: 0.59, step: 5200
lambda_1: 0.0847, lambda_2: 42.5851 lambda_3: 0.0000
train remain: [0.67 0.71 0.62 0.55 0.93 0.68 0.63 0.71 0.59]
infer remain: [0.65, 0.68, 0.59, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.26, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000001000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
loss: 0.003429, lagrangian_loss: 0.058079, attention_score_distillation_loss: 0.000978
loss: 0.003739, lagrangian_loss: 0.072819, attention_score_distillation_loss: 0.000984
ETA: 0:08:48 | Epoch 66 finished. Took 40.95 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:35:11
Evaluating: accuracy: 0.6823, eval_loss: 2.1643, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6409, expected_sequence_sparsity: 0.9085, target_sparsity: 0.59, step: 5250
lambda_1: 0.6754, lambda_2: 43.1551 lambda_3: 0.0000
train remain: [0.67 0.71 0.63 0.55 0.93 0.68 0.63 0.72 0.6 ]
infer remain: [0.65, 0.68, 0.6, 0.54, 0.87, 0.67, 0.62, 0.71, 0.59]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111011111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6643 @ step 3950 epoch 50.64
Saving the best model so far: [Epoch 67 | Step: 5250 | MACs sparsity: 0.6494 | Score: 0.6823 | Loss: 2.1643]
loss: 0.004336, lagrangian_loss: 0.076222, attention_score_distillation_loss: 0.000984
loss: 0.003434, lagrangian_loss: 0.093380, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:36:10
Evaluating: accuracy: 0.6137, eval_loss: 2.5237, token_prune_loc: [True, True, True, True, True, True, True, True, True], macs_sparsity: 0.6494, expected_sparsity: 0.6408, expected_sequence_sparsity: 0.9085, target_sparsity: 0.59, step: 5300
lambda_1: 1.2383, lambda_2: 43.6787 lambda_3: 0.0000
train remain: [0.67 0.71 0.63 0.55 0.94 0.68 0.63 0.72 0.6 ]
infer remain: [0.65, 0.68, 0.6, 0.54, 0.87, 0.67, 0.62, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.44, 0.27, 0.14, 0.12, 0.08, 0.05, 0.04, 0.02]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100001001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010
1101111111111111001001111111101011011001101101011000100000110101011100111111010101100000000000000000
1111111111111111111111111101111111111111111101111001111111111111011111111111111101111111111010001000
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110100001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.003758, lagrangian_loss: 0.100397, attention_score_distillation_loss: 0.000987
ETA: 0:08:14 | Epoch 67 finished. Took 74.39 seconds.
loss: 0.004088, lagrangian_loss: 0.090482, attention_score_distillation_loss: 0.000982
----------------------------------------------------------------------
time: 2023-07-19 15:36:36
Evaluating: accuracy: 0.6101, eval_loss: 2.4932, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.643, expected_sparsity: 0.6343, expected_sequence_sparsity: 0.9068, target_sparsity: 0.59, step: 5350
lambda_1: 1.7714, lambda_2: 44.1531 lambda_3: 0.0000
train remain: [0.67 0.71 0.63 0.56 0.94 0.68 0.63 0.72 0.6 ]
infer remain: [0.65, 0.69, 0.6, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.65, 0.45, 0.27, 0.15, 0.15, 0.1, 0.06, 0.04, 0.03]
1111111111111011111111101111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001100
1010111111111111111111111110111111111100111101010010110111001001001101100010001000101000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.134332, lagrangian_loss: 0.109142, attention_score_distillation_loss: 0.000986
loss: 0.003198, lagrangian_loss: 0.106465, attention_score_distillation_loss: 0.000986
ETA: 0:07:32 | Epoch 68 finished. Took 39.27 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:37:02
Evaluating: accuracy: 0.6137, eval_loss: 2.5805, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 5400
lambda_1: 2.2698, lambda_2: 44.5713 lambda_3: 0.0000
train remain: [0.68 0.72 0.64 0.56 0.94 0.68 0.64 0.72 0.6 ]
infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03]
1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001100
1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.004231, lagrangian_loss: 0.106258, attention_score_distillation_loss: 0.000986
loss: 0.003570, lagrangian_loss: 0.106881, attention_score_distillation_loss: 0.000984
----------------------------------------------------------------------
time: 2023-07-19 15:37:27
Evaluating: accuracy: 0.6101, eval_loss: 2.5148, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6404, expected_sparsity: 0.6308, expected_sequence_sparsity: 0.9059, target_sparsity: 0.59, step: 5450
lambda_1: 2.7270, lambda_2: 44.9251 lambda_3: 0.0000
train remain: [0.68 0.72 0.64 0.56 0.94 0.68 0.64 0.72 0.6 ]
infer remain: [0.66, 0.69, 0.61, 0.55, 1.0, 0.67, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.28, 0.15, 0.15, 0.1, 0.06, 0.05, 0.03]
1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001100
1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000010
1101111111111111001001111111101011011001111101011000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.158236, lagrangian_loss: 0.109893, attention_score_distillation_loss: 0.000985
ETA: 0:06:51 | Epoch 69 finished. Took 40.79 seconds.
loss: 0.001853, lagrangian_loss: 0.098316, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:37:53
Evaluating: accuracy: 0.6426, eval_loss: 2.2234, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6352, expected_sparsity: 0.627, expected_sequence_sparsity: 0.9049, target_sparsity: 0.59, step: 5500
lambda_1: 3.1372, lambda_2: 45.2110 lambda_3: 0.0000
train remain: [0.68 0.73 0.65 0.57 0.95 0.69 0.64 0.72 0.6 ]
infer remain: [0.66, 0.7, 0.62, 0.56, 1.0, 0.68, 0.63, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.66, 0.46, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03]
1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101010000001010
1110111011111111110110100011110111111111110111111111001110011011110100111110100101001001011011001101
1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000000
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.003346, lagrangian_loss: 0.099365, attention_score_distillation_loss: 0.000986
loss: 0.002418, lagrangian_loss: 0.093740, attention_score_distillation_loss: 0.000984
ETA: 0:06:10 | Epoch 70 finished. Took 39.22 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:38:19
Evaluating: accuracy: 0.6498, eval_loss: 2.2467, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6326, expected_sparsity: 0.6243, expected_sequence_sparsity: 0.9042, target_sparsity: 0.59, step: 5550
lambda_1: 3.4941, lambda_2: 45.4272 lambda_3: 0.0000
train remain: [0.69 0.74 0.66 0.57 0.95 0.69 0.64 0.72 0.61]
infer remain: [0.67, 0.7, 0.62, 0.56, 1.0, 0.68, 0.64, 0.71, 0.6]
layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.47, 0.29, 0.16, 0.16, 0.11, 0.07, 0.05, 0.03]
1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101011000001010
1110111011111111110110100011111111111111110111111111001110011011110100111110100001001001011011001101
1010111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000100000110101010100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000000
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010100
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100000
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.001547, lagrangian_loss: 0.092779, attention_score_distillation_loss: 0.000986
loss: 0.002488, lagrangian_loss: 0.083619, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:38:45
Evaluating: accuracy: 0.657, eval_loss: 2.304, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.63, expected_sparsity: 0.6204, expected_sequence_sparsity: 0.9032, target_sparsity: 0.59, step: 5600
lambda_1: 3.8006, lambda_2: 45.5856 lambda_3: 0.0000
train remain: [0.69 0.74 0.66 0.58 0.95 0.69 0.65 0.72 0.61]
infer remain: [0.67, 0.71, 0.63, 0.57, 1.0, 0.68, 0.64, 0.72, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.67, 0.48, 0.3, 0.17, 0.17, 0.12, 0.07, 0.05, 0.03]
1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101011000001010
1110111011111111110110100011111111111111110111111111011110011011110100111110100001001001011011001101
1011111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010100011111111001000011101010111110011011011000001
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010101
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.004084, lagrangian_loss: 0.082704, attention_score_distillation_loss: 0.000987
ETA: 0:05:28 | Epoch 71 finished. Took 40.69 seconds.
loss: 0.007104, lagrangian_loss: 0.071440, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:39:09
Evaluating: accuracy: 0.6426, eval_loss: 2.3328, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6287, expected_sparsity: 0.6173, expected_sequence_sparsity: 0.9024, target_sparsity: 0.59, step: 5650
lambda_1: 4.0550, lambda_2: 45.6941 lambda_3: 0.0000
train remain: [0.7 0.74 0.67 0.58 0.96 0.7 0.65 0.73 0.61]
infer remain: [0.68, 0.71, 0.63, 0.57, 1.0, 0.69, 0.65, 0.72, 0.61]
layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.17, 0.17, 0.12, 0.08, 0.06, 0.03]
1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011000001010
1110111011111111110110100011111111111111110111111111001110011011110100111110100001011001011011001101
1011111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000100000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000001
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100000
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000000
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.003086, lagrangian_loss: 0.056156, attention_score_distillation_loss: 0.000984
loss: 0.004400, lagrangian_loss: 0.049035, attention_score_distillation_loss: 0.000985
ETA: 0:04:47 | Epoch 72 finished. Took 38.71 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:39:35
Evaluating: accuracy: 0.6462, eval_loss: 2.3567, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.624, expected_sparsity: 0.6163, expected_sequence_sparsity: 0.9021, target_sparsity: 0.59, step: 5700
lambda_1: 4.2374, lambda_2: 45.7497 lambda_3: 0.0000
train remain: [0.71 0.75 0.67 0.59 0.96 0.7 0.66 0.73 0.62]
infer remain: [0.68, 0.71, 0.63, 0.58, 1.0, 0.69, 0.65, 0.72, 0.62]
layerwise remain: [1.0, 1.0, 1.0, 0.68, 0.48, 0.3, 0.18, 0.18, 0.12, 0.08, 0.06, 0.04]
1111111111111011111111111111000011011011100111100010110110111011111111110101010001010101011100001010
1110111011111111110110100011111111111111110111111111001110011011110100111110100001011001011011001101
1011111111111111111111111110111111111100111101010010110111001001101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011111111111111111111001001111010101011111111001000011101010111110011011011000001
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010111
1111011101011111011111111101111011101011111011111111100011111011101001100110110101101111101110100001
0000010111111111011111111110110111101011111111111000111100101111010101011011010011010100010000000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.334937, lagrangian_loss: 0.035039, attention_score_distillation_loss: 0.000984
loss: 0.005062, lagrangian_loss: 0.031952, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:40:01
Evaluating: accuracy: 0.6354, eval_loss: 2.3583, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6175, expected_sparsity: 0.6105, expected_sequence_sparsity: 0.9006, target_sparsity: 0.59, step: 5750
lambda_1: 4.3475, lambda_2: 45.7704 lambda_3: 0.0000
train remain: [0.71 0.75 0.68 0.59 0.96 0.7 0.66 0.73 0.63]
infer remain: [0.69, 0.72, 0.64, 0.58, 1.0, 0.69, 0.65, 0.73, 0.63]
layerwise remain: [1.0, 1.0, 1.0, 0.69, 0.5, 0.32, 0.18, 0.18, 0.13, 0.08, 0.06, 0.04]
1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011000001011
1110111011111111110110100011111111111111110111111111011110011011110100111110100001011001011011001101
1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000001
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000110010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010000000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.003359, lagrangian_loss: 0.018613, attention_score_distillation_loss: 0.000986
ETA: 0:04:06 | Epoch 73 finished. Took 40.27 seconds.
loss: 0.006600, lagrangian_loss: 0.009017, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:40:27
Evaluating: accuracy: 0.6209, eval_loss: 2.4687, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6175, expected_sparsity: 0.6098, expected_sequence_sparsity: 0.9005, target_sparsity: 0.59, step: 5800
lambda_1: 4.3975, lambda_2: 45.7758 lambda_3: 0.0000
train remain: [0.72 0.76 0.68 0.59 0.96 0.71 0.66 0.74 0.64]
infer remain: [0.69, 0.72, 0.64, 0.58, 1.0, 0.7, 0.66, 0.73, 0.64]
layerwise remain: [1.0, 1.0, 1.0, 0.69, 0.5, 0.32, 0.18, 0.18, 0.13, 0.09, 0.06, 0.04]
1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011000001011
1110111011111111110110100111110111111111110111111111011110011011110100111110100001011001011011001101
1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000011
1101111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001000011101011111110011011011000001
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001
0000010111111111011111111110110111101111111111111000111100101111010101011011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.004027, lagrangian_loss: -0.008580, attention_score_distillation_loss: 0.000984
loss: 0.002055, lagrangian_loss: -0.008641, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:40:52
Evaluating: accuracy: 0.6498, eval_loss: 2.3617, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6162, expected_sparsity: 0.606, expected_sequence_sparsity: 0.8995, target_sparsity: 0.59, step: 5850
lambda_1: 4.3736, lambda_2: 45.7781 lambda_3: 0.0000
train remain: [0.72 0.77 0.69 0.6 0.97 0.71 0.67 0.74 0.65]
infer remain: [0.7, 0.72, 0.64, 0.59, 1.0, 0.7, 0.66, 0.73, 0.65]
layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.5, 0.32, 0.19, 0.19, 0.13, 0.09, 0.06, 0.04]
1111111111111011111111111111100011011011100111100010110110111011111111110101010001010101011100001011
1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001101
1011111111111111111111111110111111111100111101010010110111001011101101100010001000101000001010000011
1101111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001000011101011111110011011011000001
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001
0001010111111111011111111110110111101011111111111000111100101111010101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
ETA: 0:03:25 | Epoch 74 finished. Took 41.12 seconds.
loss: 0.002758, lagrangian_loss: -0.012068, attention_score_distillation_loss: 0.000988
loss: 0.002605, lagrangian_loss: -0.029704, attention_score_distillation_loss: 0.000984
----------------------------------------------------------------------
time: 2023-07-19 15:41:18
Evaluating: accuracy: 0.6173, eval_loss: 2.4781, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.611, expected_sparsity: 0.6028, expected_sequence_sparsity: 0.8987, target_sparsity: 0.59, step: 5900
lambda_1: 4.2653, lambda_2: 45.7971 lambda_3: 0.0000
train remain: [0.73 0.77 0.69 0.6 0.97 0.71 0.67 0.74 0.66]
infer remain: [0.7, 0.73, 0.65, 0.59, 1.0, 0.7, 0.66, 0.73, 0.66]
layerwise remain: [1.0, 1.0, 1.0, 0.7, 0.51, 0.33, 0.2, 0.2, 0.14, 0.09, 0.07, 0.04]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010101011000001011
1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000011
1101111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010101011111111001000011101011111110011011011000011
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100001
0101010111111111011111111110110111101011111111111000111100101111010101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.002687, lagrangian_loss: -0.030338, attention_score_distillation_loss: 0.000987
loss: 0.002762, lagrangian_loss: -0.041094, attention_score_distillation_loss: 0.000986
ETA: 0:02:44 | Epoch 75 finished. Took 39.1 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:41:43
Evaluating: accuracy: 0.6318, eval_loss: 2.5083, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6076, expected_sparsity: 0.5995, expected_sequence_sparsity: 0.8978, target_sparsity: 0.59, step: 5950
lambda_1: 4.0637, lambda_2: 45.8584 lambda_3: 0.0000
train remain: [0.73 0.78 0.7 0.61 0.97 0.72 0.67 0.75 0.68]
infer remain: [0.71, 0.73, 0.65, 0.59, 1.0, 0.71, 0.66, 0.74, 0.67]
layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.52, 0.34, 0.2, 0.2, 0.14, 0.09, 0.07, 0.05]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010101011100001011
1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000011
1101111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001000011101011111110011011011000011
1011111111111011111111111111011101001010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011111001100110111101101111101110100001
0101010111111111011111111110110111101011111111111010111100101111010101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.003304, lagrangian_loss: -0.052091, attention_score_distillation_loss: 0.000984
loss: 0.003466, lagrangian_loss: -0.051544, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:42:09
Evaluating: accuracy: 0.6318, eval_loss: 2.4144, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.6063, expected_sparsity: 0.598, expected_sequence_sparsity: 0.8974, target_sparsity: 0.59, step: 6000
lambda_1: 3.7838, lambda_2: 45.9723 lambda_3: 0.0000
train remain: [0.74 0.78 0.7 0.61 0.97 0.72 0.67 0.75 0.69]
infer remain: [0.71, 0.73, 0.65, 0.6, 1.0, 0.71, 0.67, 0.74, 0.68]
layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.52, 0.34, 0.2, 0.2, 0.14, 0.1, 0.07, 0.05]
1111111111111011111111111111100011011011100111100010110110111011111111110101010011010111011000001011
1110111011111111110110100011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111100111101010010111111001011101101100010001000101000001010000011
1111111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011
0111010111111111011111111110110111101011111111111010111100101111010101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.001548, lagrangian_loss: -0.052469, attention_score_distillation_loss: 0.000984
ETA: 0:02:03 | Epoch 76 finished. Took 40.54 seconds.
loss: 0.002003, lagrangian_loss: -0.051637, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:42:35
Evaluating: accuracy: 0.6101, eval_loss: 2.5264, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5998, expected_sparsity: 0.5947, expected_sequence_sparsity: 0.8966, target_sparsity: 0.59, step: 6050
lambda_1: 3.4407, lambda_2: 46.1415 lambda_3: 0.0000
train remain: [0.74 0.78 0.7 0.61 0.97 0.72 0.67 0.75 0.69]
infer remain: [0.71, 0.74, 0.66, 0.6, 1.0, 0.71, 0.67, 0.74, 0.69]
layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.07, 0.05]
1111111111111011111111111111100011111011100111100010110110111011111111110101010011010101011000001011
1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011
1111111111111111001001111111111011011001111101111000101000110101011100111111010101100000000000000001
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011
0111010111111111011111111110110111101111111111111010111100101111010101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.002690, lagrangian_loss: -0.049374, attention_score_distillation_loss: 0.000984
loss: 0.002767, lagrangian_loss: -0.045989, attention_score_distillation_loss: 0.000984
ETA: 0:01:22 | Epoch 77 finished. Took 39.42 seconds.
----------------------------------------------------------------------
time: 2023-07-19 15:43:01
Evaluating: accuracy: 0.6173, eval_loss: 2.453, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5998, expected_sparsity: 0.5946, expected_sequence_sparsity: 0.8965, target_sparsity: 0.59, step: 6100
lambda_1: 3.0483, lambda_2: 46.3619 lambda_3: 0.0000
train remain: [0.75 0.79 0.7 0.61 0.97 0.72 0.68 0.75 0.7 ]
infer remain: [0.71, 0.74, 0.66, 0.6, 1.0, 0.71, 0.67, 0.74, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 0.71, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.07, 0.05]
1111111111111011111111111111100011111011100111100010110110111011111111110101010011010101011000001011
1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011
1111111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000011
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011
0111010111111111011111111110110111101111111111111010111100101111110101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.002130, lagrangian_loss: -0.043038, attention_score_distillation_loss: 0.000987
loss: 0.004331, lagrangian_loss: -0.039912, attention_score_distillation_loss: 0.000986
----------------------------------------------------------------------
time: 2023-07-19 15:43:27
Evaluating: accuracy: 0.6245, eval_loss: 2.5277, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5985, expected_sparsity: 0.5918, expected_sequence_sparsity: 0.8958, target_sparsity: 0.59, step: 6150
lambda_1: 2.5994, lambda_2: 46.6516 lambda_3: 0.0000
train remain: [0.75 0.79 0.7 0.61 0.97 0.72 0.68 0.75 0.7 ]
infer remain: [0.72, 0.74, 0.66, 0.6, 1.0, 0.71, 0.67, 0.74, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.07, 0.05]
1111111111111011111111111111100011111011100111100010110110111011111111110101010011010111011000001011
1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011
1111111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000011
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010110011111111001100011101011111110011011011000011
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011
0111010111111111011111111110110111101111111111111010111100101111110101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.002988, lagrangian_loss: -0.034397, attention_score_distillation_loss: 0.000986
ETA: 0:00:41 | Epoch 78 finished. Took 40.82 seconds.
loss: 0.003530, lagrangian_loss: -0.029778, attention_score_distillation_loss: 0.000985
----------------------------------------------------------------------
time: 2023-07-19 15:43:52
Evaluating: accuracy: 0.6282, eval_loss: 2.495, token_prune_loc: [True, True, True, True, False, True, True, True, True], macs_sparsity: 0.5972, expected_sparsity: 0.5913, expected_sequence_sparsity: 0.8957, target_sparsity: 0.59, step: 6200
lambda_1: 2.1262, lambda_2: 46.9763 lambda_3: 0.0000
train remain: [0.75 0.79 0.7 0.61 0.97 0.72 0.68 0.75 0.7 ]
infer remain: [0.72, 0.74, 0.66, 0.6, 1.0, 0.72, 0.67, 0.74, 0.7]
layerwise remain: [1.0, 1.0, 1.0, 0.72, 0.53, 0.35, 0.21, 0.21, 0.15, 0.1, 0.08, 0.05]
1111111111111011111111111111100011111011100111100010110110111011111111110101010011010111011000001011
1110111011111111110110101011110111111111110111111111011110111011110100111110100001011001011011001111
1011111111111111111111111110111111111110111101010010111111001011101101100010001000101000001010000011
1111111111111111001001111111101011011001111101111000101000110101011100111111010101100000000000000011
1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
1010111111101101111011110111111111111111001001111010111011111111001100011101011111110011011011000011
1011111111111011111111111111011101101010111111111110110001000011111010100101011001100000000111010111
1111011101011111011111111101111011101011111011111111100011111011101001100110111101101111101110100011
0111010111111111011111111110110111101111111111111010111100101111110101111011010011010100010001000101
Best eval score so far: 0.6823 @ step 5250 epoch 67.31
loss: 0.173288, lagrangian_loss: -0.023798, attention_score_distillation_loss: 0.000984
loss: 0.003138, lagrangian_loss: -0.018495, attention_score_distillation_loss: 0.000986
ETA: 0:00:00 | Epoch 79 finished. Took 39.2 seconds.
07/19/2023 15:46:17 - WARNING - urllib3.connectionpool - Retrying (Retry(total=4, connect=5, read=4, redirect=5, status=5)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='southcentralus.api.azureml.ms', port=443): Read timed out. (read timeout=120)")': /mlflow/v2.0/subscriptions/d4404794-ab5b-48de-b7c7-ec1fefb0a04e/resourceGroups/gcr-singularity-octo/providers/Microsoft.MachineLearningServices/workspaces/msroctows/api/2.0/mlflow/runs/get?run_uuid=c90f704e-2048-427a-a825-62713710c8b9&run_id=c90f704e-2048-427a-a825-62713710c8b9