/var/spool/slurmd/job336337/slurm_script: line 12: activate: No such file or directory W0217 20:17:08.481000 3089289 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.481000 3089289 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.481000 3089289 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.481000 3089289 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.635000 48757 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.635000 48757 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.635000 48757 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.635000 48757 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.684000 1807834 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.684000 1807834 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.684000 1807834 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.684000 1807834 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.770000 26625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.770000 26625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.770000 26625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.770000 26625 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.795000 26604 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.795000 26604 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.795000 26604 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.795000 26604 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.843000 69159 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.843000 69159 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.843000 69159 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.843000 69159 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.880000 3100165 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.880000 3100165 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.880000 3100165 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.880000 3100165 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.920000 88915 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.920000 88915 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.920000 88915 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.920000 88915 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.933000 1618782 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:08.933000 1618782 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:08.933000 1618782 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:08.933000 1618782 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.043000 91872 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.043000 91872 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.043000 91872 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.043000 91872 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.063000 91580 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.063000 91580 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.063000 91580 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.063000 91580 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.123000 26057 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.123000 26057 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.123000 26057 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.123000 26057 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.169000 91237 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.169000 91237 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.169000 91237 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.169000 91237 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.174000 88854 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.174000 88854 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.174000 88854 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.174000 88854 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.210000 88279 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.210000 88279 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.210000 88279 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.210000 88279 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.997000 3156229 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0217 20:17:09.997000 3156229 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0217 20:17:09.997000 3156229 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0217 20:17:09.997000 3156229 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. loading configuration file config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/config.json You are using a model of type qwen2_5_vl to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors. Model config LlavaQwenConfig { "architectures": [ "Qwen2_5_VLForConditionalGeneration" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "image_token_id": 151655, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 128000, "max_window_layers": 28, "model_type": "llava_qwen", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": { "mrope_section": [ 16, 24, 24 ], "rope_type": "default", "type": "default" }, "rope_theta": 1000000.0, "sliding_window": 32768, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.49.0.dev0", "use_cache": true, "use_sliding_window": false, "video_token_id": 151656, "vision_config": { "hidden_size": 1280, "in_chans": 3, "model_type": "qwen2_5_vl", "spatial_patch_size": 14, "tokens_per_second": 2 }, "vision_end_token_id": 151653, "vision_start_token_id": 151652, "vision_token_id": 151654, "vocab_size": 152064 } loading weights file model.safetensors from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/model.safetensors.index.json Instantiating LlavaQwenForCausalLM model under default dtype torch.bfloat16. You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`. Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645 } Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16. Loading checkpoint shards: 0%| | 0/5 [00:00', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file chat_template.jinja from cache at None Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), loading file added_tokens.json from cache at None 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 60%|██████ | 3/5 [00:00<00:00, 3.96it/s]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Loading checkpoint shards: 80%|████████ | 4/5 [00:01<00:00, 4.04it/s]Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.33it/s] Loading checkpoint shards: 100%|██████████| 5/5 [00:01<00:00, 4.11it/s] All model checkpoint weights were used when initializing LlavaQwenForCausalLM. All the weights of LlavaQwenForCausalLM were initialized from the model checkpoint at Qwen/Qwen2.5-VL-7B-Instruct. If your task is similar to the task the model of the checkpoint was trained on, you can already use LlavaQwenForCausalLM for predictions without further training. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } loading configuration file generation_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/generation_config.json Generate config GenerationConfig { "attn_implementation": "flash_attention_2", "bos_token_id": 151643, "do_sample": true, "eos_token_id": [ 151645, 151643 ], "pad_token_id": 151643, "repetition_penalty": 1.05, "temperature": 0.1, "top_k": 1, "top_p": 0.001 } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json loading configuration file preprocessor_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/preprocessor_config.json Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. Image processor Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } loading file vocab.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/vocab.json loading file merges.txt from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/merges.txt loading file tokenizer.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer.json loading file added_tokens.json from cache at None loading file special_tokens_map.json from cache at None loading file tokenizer_config.json from cache at /fsx_0/user/zhaojiang/models/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/6e6556e8ce728c7b3e438d75ebf04ec93403dc19/tokenizer_config.json loading file chat_template.jinja from cache at None Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Processor Qwen2_5_VLProcessor: - image_processor: Qwen2VLImageProcessor { "do_convert_rgb": true, "do_normalize": true, "do_rescale": true, "do_resize": true, "image_mean": [ 0.48145466, 0.4578275, 0.40821073 ], "image_processor_type": "Qwen2VLImageProcessor", "image_std": [ 0.26862954, 0.26130258, 0.27577711 ], "max_pixels": 12845056, "merge_size": 2, "min_pixels": 3136, "patch_size": 14, "processor_class": "Qwen2_5_VLProcessor", "resample": 3, "rescale_factor": 0.00392156862745098, "size": { "longest_edge": 12845056, "shortest_edge": 3136 }, "temporal_patch_size": 2 } - tokenizer: Qwen2TokenizerFast(name_or_path='Qwen/Qwen2.5-VL-7B-Instruct', vocab_size=151643, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'eos_token': '<|im_end|>', 'pad_token': '<|endoftext|>', 'additional_special_tokens': ['<|im_start|>', '<|im_end|>', '<|object_ref_start|>', '<|object_ref_end|>', '<|box_start|>', '<|box_end|>', '<|quad_start|>', '<|quad_end|>', '<|vision_start|>', '<|vision_end|>', '<|vision_pad|>', '<|image_pad|>', '<|video_pad|>']}, clean_up_tokenization_spaces=False, added_tokens_decoder={ 151643: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151644: AddedToken("<|im_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151645: AddedToken("<|im_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151646: AddedToken("<|object_ref_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151647: AddedToken("<|object_ref_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151648: AddedToken("<|box_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151649: AddedToken("<|box_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151650: AddedToken("<|quad_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151651: AddedToken("<|quad_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151652: AddedToken("<|vision_start|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151653: AddedToken("<|vision_end|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151654: AddedToken("<|vision_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151655: AddedToken("<|image_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151656: AddedToken("<|video_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True), 151657: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151658: AddedToken("", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151659: AddedToken("<|fim_prefix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151660: AddedToken("<|fim_middle|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151661: AddedToken("<|fim_suffix|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151662: AddedToken("<|fim_pad|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151663: AddedToken("<|repo_name|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), 151664: AddedToken("<|file_sep|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=False), } ) { "processor_class": "Qwen2_5_VLProcessor" } You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 151668. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/model/multimodal_encoder/eva_clip/eva_vit.py:622: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint = torch.load(checkpoint_path, map_location=map_location) Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Using custom data configuration default-5e4e9de28fd39dca Loading Dataset Infos from /home/zhaojiang/.local/lib/python3.10/site-packages/datasets/packaged_modules/webdataset Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Overwrite dataset info from restored data version if exists. Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f Found cached dataset webdataset (/fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f) Loading Dataset info from /fsx_0/user/zhaojiang/wb/webdataset/default-5e4e9de28fd39dca/0.0.0/e9ef0843eead451e800ef3bd9a9ee86b731520f88aa20be2d598ddfeef5b3f7f /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Using auto half precision backend /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py:1616: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `LLaVATrainer.__init__`. Use `processing_class` instead. trainer = LLaVATrainer( Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 Attempting to resume from /fsx_0/user/zhaojiang/models/qwen-vl-gen/checkpoint-8000 ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Currently logged in as: jchen169 to https://api.wandb.ai. Use `wandb login --relogin` to force relogin wandb: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. wandb: Tracking run with wandb version 0.19.6 wandb: Run data is saved locally in /opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/wandb/run-20250217_203915-bihwjece wandb: Run `wandb offline` to turn off syncing. wandb: Syncing run qwen-vl-diff-clip-16-nodes_early_pool2d_4 wandb: ⭐️ View project at https://wandb.ai/jchen169/huggingface wandb: 🚀 View run at https://wandb.ai/jchen169/huggingface/runs/bihwjece ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. ***** Running training ***** Num examples = 194,420,624 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 1,024 Gradient Accumulation steps = 1 Total optimization steps = 569,592 Number of trainable parameters = 1,365,239,712 Continuing training from checkpoint, will skip to saved global_step Continuing training from epoch 0 Continuing training from global step 8000 Will skip the first 0 epochs then the first 8000 batches in the first epoch. /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) /home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py:3119: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. checkpoint_rng_state = torch.load(rng_file) 0%| | 0/569592 [00:00 train(attn_implementation="flash_attention_2") File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1631, in train trainer.train(resume_from_checkpoint=True) File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train return inner_training_loop( File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop self._maybe_log_save_evaluate( File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3093, in _maybe_log_save_evaluate self.control = self.callback_handler.on_save(self.args, self.state, self.control) File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer_callback.py", line 546, in on_save return self.call_event("on_save", args, state, control) File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer_callback.py", line 557, in call_event result = getattr(callback, event)( File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1608, in on_save upload_folder( File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1524, in _inner return fn(self, *args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4677, in upload_folder commit_info = self.create_commit( File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1524, in _inner return fn(self, *args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3961, in create_commit self.preupload_lfs_files( File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4215, in preupload_lfs_files _upload_lfs_files( File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 452, in _upload_lfs_files thread_map( File "/usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py", line 69, in thread_map return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs) File "/usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs)) File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__ for obj in iterable: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator yield _result_or_cancel(fs.pop()) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel return fut.result(timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 439, in _wrapped_lfs_upload raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc RuntimeError: Error while uploading 'checkpoint-30000/rng_state_101.pth' to the Hub. [rank0]: Traceback (most recent call last): [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status [rank0]: response.raise_for_status() [rank0]: File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1024, in raise_for_status [rank0]: raise HTTPError(http_error_msg, response=self) [rank0]: requests.exceptions.HTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/complete_multipart?uploadId=7Kmb1Pc9aaAEQVAPW5WowD35RLZSwmHasMnyMVnAxQ1kC7POBPjOojl2diFOQb3TaSYujtO6667Xpn5rI4HALaTz8BVAfAMFmsbR7Bz5Oi82KA_gFGLsZtElN8LPu7VQ&bucket=hf-hub-lfs-us-east-1&prefix=repos%2Fe0%2F93%2Fe09307d4e6846d783ee679d0240cc38c52f308be58f71c108a7978856a27c6ad&expiration=Wed%2C+19+Feb+2025+14%3A58%3A44+GMT&signature=a8119b282ccf29579b2529f4bb9bb63120f50b7fad42ccd024a6035b4a738206 [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 437, in _wrapped_lfs_upload [rank0]: lfs_upload(operation=operation, lfs_batch_action=batch_action, headers=headers, endpoint=endpoint) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 246, in lfs_upload [rank0]: _upload_multi_part(operation=operation, header=header, chunk_size=chunk_size, upload_url=upload_url) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/lfs.py", line 355, in _upload_multi_part [rank0]: hf_raise_for_status(completion_res) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 477, in hf_raise_for_status [rank0]: raise _format(HfHubHTTPError, str(e), response) from e [rank0]: huggingface_hub.errors.HfHubHTTPError: 504 Server Error: Gateway Time-out for url: https://huggingface.co/api/complete_multipart?uploadId=7Kmb1Pc9aaAEQVAPW5WowD35RLZSwmHasMnyMVnAxQ1kC7POBPjOojl2diFOQb3TaSYujtO6667Xpn5rI4HALaTz8BVAfAMFmsbR7Bz5Oi82KA_gFGLsZtElN8LPu7VQ&bucket=hf-hub-lfs-us-east-1&prefix=repos%2Fe0%2F93%2Fe09307d4e6846d783ee679d0240cc38c52f308be58f71c108a7978856a27c6ad&expiration=Wed%2C+19+Feb+2025+14%3A58%3A44+GMT&signature=a8119b282ccf29579b2529f4bb9bb63120f50b7fad42ccd024a6035b4a738206 [rank0]: The above exception was the direct cause of the following exception: [rank0]: Traceback (most recent call last): [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank0]: train(attn_implementation="flash_attention_2") [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1631, in train [rank0]: trainer.train(resume_from_checkpoint=True) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2241, in train [rank0]: return inner_training_loop( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2612, in _inner_training_loop [rank0]: self._maybe_log_save_evaluate( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3093, in _maybe_log_save_evaluate [rank0]: self.control = self.callback_handler.on_save(self.args, self.state, self.control) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer_callback.py", line 546, in on_save [rank0]: return self.call_event("on_save", args, state, control) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/trainer_callback.py", line 557, in call_event [rank0]: result = getattr(callback, event)( [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1608, in on_save [rank0]: upload_folder( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1524, in _inner [rank0]: return fn(self, *args, **kwargs) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4677, in upload_folder [rank0]: commit_info = self.create_commit( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 1524, in _inner [rank0]: return fn(self, *args, **kwargs) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 3961, in create_commit [rank0]: self.preupload_lfs_files( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/hf_api.py", line 4215, in preupload_lfs_files [rank0]: _upload_lfs_files( [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn [rank0]: return fn(*args, **kwargs) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 452, in _upload_lfs_files [rank0]: thread_map( [rank0]: File "/usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py", line 69, in thread_map [rank0]: return _executor_map(ThreadPoolExecutor, fn, *iterables, **tqdm_kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/tqdm/contrib/concurrent.py", line 51, in _executor_map [rank0]: return list(tqdm_class(ex.map(fn, *iterables, chunksize=chunksize), **kwargs)) [rank0]: File "/usr/local/lib/python3.10/dist-packages/tqdm/std.py", line 1181, in __iter__ [rank0]: for obj in iterable: [rank0]: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator [rank0]: yield _result_or_cancel(fs.pop()) [rank0]: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel [rank0]: return fut.result(timeout) [rank0]: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result [rank0]: return self.__get_result() [rank0]: File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result [rank0]: raise self._exception [rank0]: File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run [rank0]: result = self.fn(*self.args, **self.kwargs) [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/huggingface_hub/_commit_api.py", line 439, in _wrapped_lfs_upload [rank0]: raise RuntimeError(f"Error while uploading '{operation.path_in_repo}' to the Hub.") from exc [rank0]: RuntimeError: Error while uploading 'checkpoint-30000/rng_state_101.pth' to the Hub. [rank0]:[W218 15:00:19.506684404 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank13]:[E218 15:08:11.511786445 ProcessGroupNCCL.cpp:616] [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. [rank13]:[E218 15:08:11.532294490 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 13] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank28]:[E218 15:08:11.148231267 ProcessGroupNCCL.cpp:616] [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [rank45]:[E218 15:08:11.082260003 ProcessGroupNCCL.cpp:616] [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [rank28]:[E218 15:08:11.163320437 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 28] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank45]:[E218 15:08:11.094462969 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 45] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank50]:[E218 15:08:11.210762659 ProcessGroupNCCL.cpp:616] [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [rank12]:[E218 15:08:11.798960516 ProcessGroupNCCL.cpp:616] [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. [rank89]:[E218 15:08:11.808220410 ProcessGroupNCCL.cpp:616] [Rank 89] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. [rank79]:[E218 15:08:12.760853759 ProcessGroupNCCL.cpp:616] [Rank 79] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. [rank40]:[E218 15:08:11.284080024 ProcessGroupNCCL.cpp:616] [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [rank79]:[E218 15:08:12.833471351 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 79] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank50]:[E218 15:08:12.510866545 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 50] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank78]:[E218 15:08:12.784192241 ProcessGroupNCCL.cpp:616] [Rank 78] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [rank78]:[E218 15:08:12.851187036 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 78] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank124]:[E218 15:08:11.357746138 ProcessGroupNCCL.cpp:616] [Rank 124] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. [rank112]:[E218 15:08:12.782257805 ProcessGroupNCCL.cpp:616] [Rank 112] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [rank47]:[E218 15:08:11.143274930 ProcessGroupNCCL.cpp:616] [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. [rank44]:[E218 15:08:11.254865946 ProcessGroupNCCL.cpp:616] [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [rank14]:[E218 15:08:11.794412876 ProcessGroupNCCL.cpp:616] [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [rank77]:[E218 15:08:11.679452190 ProcessGroupNCCL.cpp:616] [Rank 77] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [rank84]:[E218 15:08:11.514791443 ProcessGroupNCCL.cpp:616] [Rank 84] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank95]:[E218 15:08:12.894518066 ProcessGroupNCCL.cpp:616] [Rank 95] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. [rank32]:[E218 15:08:12.270628394 ProcessGroupNCCL.cpp:616] [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. [rank33]:[E218 15:08:11.166797970 ProcessGroupNCCL.cpp:616] [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600008 milliseconds before timing out. [rank86]:[E218 15:08:11.515976848 ProcessGroupNCCL.cpp:616] [Rank 86] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. [rank118]:[E218 15:08:12.801188487 ProcessGroupNCCL.cpp:616] [Rank 118] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [rank64]:[E218 15:08:11.515249683 ProcessGroupNCCL.cpp:616] [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [rank93]:[E218 15:08:11.808944114 ProcessGroupNCCL.cpp:616] [Rank 93] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. [rank88]:[E218 15:08:11.803106691 ProcessGroupNCCL.cpp:616] [Rank 88] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. [rank80]:[E218 15:08:11.529899222 ProcessGroupNCCL.cpp:616] [Rank 80] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. [rank70]:[E218 15:08:11.519639275 ProcessGroupNCCL.cpp:616] [Rank 70] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. [rank67]:[E218 15:08:11.524045098 ProcessGroupNCCL.cpp:616] [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. [rank105]:[E218 15:08:11.609619742 ProcessGroupNCCL.cpp:616] [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. [rank65]:[E218 15:08:11.527042194 ProcessGroupNCCL.cpp:616] [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [rank100]:[E218 15:08:11.702252229 ProcessGroupNCCL.cpp:616] [Rank 100] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [rank34]:[E218 15:08:12.231774913 ProcessGroupNCCL.cpp:616] [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [rank89]:[E218 15:08:12.024034481 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 89] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank51]:[E218 15:08:11.440715397 ProcessGroupNCCL.cpp:616] [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. [rank71]:[E218 15:08:11.533883602 ProcessGroupNCCL.cpp:616] [Rank 71] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. [rank61]:[E218 15:08:11.589879011 ProcessGroupNCCL.cpp:616] [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. [rank113]:[E218 15:08:11.750788629 ProcessGroupNCCL.cpp:616] [Rank 113] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. [rank24]:[E218 15:08:12.459261646 ProcessGroupNCCL.cpp:616] [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. [rank68]:[E218 15:08:11.539524148 ProcessGroupNCCL.cpp:616] [Rank 68] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. [rank53]:[E218 15:08:11.450319652 ProcessGroupNCCL.cpp:616] [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. [rank69]:[E218 15:08:11.541042477 ProcessGroupNCCL.cpp:616] [Rank 69] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. [rank18]:[E218 15:08:12.605340919 ProcessGroupNCCL.cpp:616] [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [rank27]:[E218 15:08:11.367213203 ProcessGroupNCCL.cpp:616] [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [rank96]:[E218 15:08:12.824465618 ProcessGroupNCCL.cpp:616] [Rank 96] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. [rank122]:[E218 15:08:12.465299508 ProcessGroupNCCL.cpp:616] [Rank 122] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. [rank37]:[E218 15:08:11.169304223 ProcessGroupNCCL.cpp:616] [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. [rank52]:[E218 15:08:12.463056854 ProcessGroupNCCL.cpp:616] [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [rank120]:[E218 15:08:11.384319683 ProcessGroupNCCL.cpp:616] [Rank 120] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [rank102]:[E218 15:08:12.736968523 ProcessGroupNCCL.cpp:616] [Rank 102] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. [rank26]:[E218 15:08:12.379641492 ProcessGroupNCCL.cpp:616] [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. [rank35]:[E218 15:08:11.166793130 ProcessGroupNCCL.cpp:616] [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [rank30]:[E218 15:08:12.418535492 ProcessGroupNCCL.cpp:616] [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank47]:[E218 15:08:12.489623993 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 47] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank107]:[E218 15:08:11.600658646 ProcessGroupNCCL.cpp:616] [Rank 107] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. [rank43]:[E218 15:08:11.300975924 ProcessGroupNCCL.cpp:616] [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank39]:[E218 15:08:12.254576917 ProcessGroupNCCL.cpp:616] [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. [rank117]:[E218 15:08:12.790096068 ProcessGroupNCCL.cpp:616] [Rank 117] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. [rank87]:[E218 15:08:12.598000652 ProcessGroupNCCL.cpp:616] [Rank 87] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. [rank114]:[E218 15:08:11.739264552 ProcessGroupNCCL.cpp:616] [Rank 114] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank57]:[E218 15:08:12.630331688 ProcessGroupNCCL.cpp:616] [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. [rank19]:[E218 15:08:12.616380708 ProcessGroupNCCL.cpp:616] [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [rank101]:[E218 15:08:11.702242579 ProcessGroupNCCL.cpp:616] [Rank 101] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. [rank111]:[E218 15:08:11.597471430 ProcessGroupNCCL.cpp:616] [Rank 111] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. [rank12]:[E218 15:08:12.087887824 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 12] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank58]:[E218 15:08:12.635482956 ProcessGroupNCCL.cpp:616] [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [rank115]:[E218 15:08:12.805657173 ProcessGroupNCCL.cpp:616] [Rank 115] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [rank85]:[E218 15:08:12.611838044 ProcessGroupNCCL.cpp:616] [Rank 85] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [rank94]:[E218 15:08:12.888854726 ProcessGroupNCCL.cpp:616] [Rank 94] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. [rank110]:[E218 15:08:11.607988292 ProcessGroupNCCL.cpp:616] [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank56]:[E218 15:08:11.596110276 ProcessGroupNCCL.cpp:616] [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. [rank15]:[E218 15:08:11.809146954 ProcessGroupNCCL.cpp:616] [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. [rank90]:[E218 15:08:12.902790047 ProcessGroupNCCL.cpp:616] [Rank 90] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. [rank14]:[E218 15:08:12.075226896 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 14] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank32]:[E218 15:08:12.487916049 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 32] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank109]:[E218 15:08:11.617420772 ProcessGroupNCCL.cpp:616] [Rank 109] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. [rank44]:[E218 15:08:12.534335659 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 44] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank121]:[E218 15:08:11.416099884 ProcessGroupNCCL.cpp:616] [Rank 121] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [rank76]:[E218 15:08:11.737962346 ProcessGroupNCCL.cpp:616] [Rank 76] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. [rank123]:[E218 15:08:12.507390406 ProcessGroupNCCL.cpp:616] [Rank 123] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. [rank62]:[E218 15:08:12.679291675 ProcessGroupNCCL.cpp:616] [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [rank55]:[E218 15:08:11.438343471 ProcessGroupNCCL.cpp:616] [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [rank54]:[E218 15:08:11.442853819 ProcessGroupNCCL.cpp:616] [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. [rank66]:[E218 15:08:12.632210157 ProcessGroupNCCL.cpp:616] [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. [rank124]:[E218 15:08:12.663631118 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 124] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank84]:[E218 15:08:12.802354290 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 84] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank104]:[E218 15:08:12.710649104 ProcessGroupNCCL.cpp:616] [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. [rank40]:[E218 15:08:12.567706033 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 40] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank88]:[E218 15:08:12.086546986 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 88] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank126]:[E218 15:08:12.505141086 ProcessGroupNCCL.cpp:616] [Rank 126] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [rank118]:[E218 15:08:12.036556017 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 118] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank106]:[E218 15:08:12.721921103 ProcessGroupNCCL.cpp:616] [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. [rank91]:[E218 15:08:12.867316908 ProcessGroupNCCL.cpp:616] [Rank 91] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank103]:[E218 15:08:11.726260437 ProcessGroupNCCL.cpp:616] [Rank 103] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. [rank127]:[E218 15:08:11.426092194 ProcessGroupNCCL.cpp:616] [Rank 127] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. [rank38]:[E218 15:08:12.243116328 ProcessGroupNCCL.cpp:616] [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. [rank125]:[E218 15:08:12.508899709 ProcessGroupNCCL.cpp:616] [Rank 125] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank82]:[E218 15:08:12.588273636 ProcessGroupNCCL.cpp:616] [Rank 82] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank42]:[E218 15:08:12.342887035 ProcessGroupNCCL.cpp:616] [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. [rank74]:[E218 15:08:12.848068658 ProcessGroupNCCL.cpp:616] [Rank 74] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank86]:[E218 15:08:12.827832849 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 86] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank60]:[E218 15:08:12.709198449 ProcessGroupNCCL.cpp:616] [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [rank112]:[E218 15:08:12.055295456 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 112] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank41]:[E218 15:08:12.304197816 ProcessGroupNCCL.cpp:616] [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. [rank33]:[E218 15:08:12.497016616 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 33] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank80]:[E218 15:08:12.837530304 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 80] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank48]:[E218 15:08:12.573803144 ProcessGroupNCCL.cpp:616] [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [rank99]:[E218 15:08:12.831141240 ProcessGroupNCCL.cpp:616] [Rank 99] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [rank34]:[E218 15:08:12.506640773 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 34] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank25]:[E218 15:08:12.466844858 ProcessGroupNCCL.cpp:616] [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. [rank75]:[E218 15:08:12.771668816 ProcessGroupNCCL.cpp:616] [Rank 75] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. [rank120]:[E218 15:08:12.706345290 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 120] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank93]:[E218 15:08:12.125228927 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 93] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank95]:[E218 15:08:12.159109315 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 95] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank70]:[E218 15:08:12.828148142 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 70] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank67]:[E218 15:08:12.828161492 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 67] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank64]:[E218 15:08:12.828155982 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 64] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank49]:[E218 15:08:12.568251380 ProcessGroupNCCL.cpp:616] [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. [rank11]:[E218 15:08:12.888312208 ProcessGroupNCCL.cpp:616] [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. [rank108]:[E218 15:08:12.687519378 ProcessGroupNCCL.cpp:616] [Rank 108] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. [rank122]:[E218 15:08:12.714412918 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 122] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank114]:[E218 15:08:12.046868573 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 114] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank65]:[E218 15:08:12.833859369 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 65] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank8]:[E218 15:08:12.886718682 ProcessGroupNCCL.cpp:616] [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. [rank35]:[E218 15:08:12.514891145 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 35] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank21]:[E218 15:08:12.712203639 ProcessGroupNCCL.cpp:616] [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [rank81]:[E218 15:08:12.611043997 ProcessGroupNCCL.cpp:616] [Rank 81] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. [rank29]:[E218 15:08:12.478582518 ProcessGroupNCCL.cpp:616] [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. [rank101]:[E218 15:08:12.020974360 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 101] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank119]:[E218 15:08:12.811305870 ProcessGroupNCCL.cpp:616] [Rank 119] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. [rank63]:[E218 15:08:12.633639450 ProcessGroupNCCL.cpp:616] [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. [rank9]:[E218 15:08:12.891455956 ProcessGroupNCCL.cpp:616] [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. [rank51]:[E218 15:08:12.747826338 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 51] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank10]:[E218 15:08:12.897764493 ProcessGroupNCCL.cpp:616] [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [rank76]:[E218 15:08:12.042557058 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 76] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank105]:[E218 15:08:12.940819489 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 105] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank36]:[E218 15:08:12.267064922 ProcessGroupNCCL.cpp:616] [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. [rank94]:[E218 15:08:12.159335149 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 94] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank100]:[E218 15:08:12.020995901 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 100] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank83]:[E218 15:08:12.716103008 ProcessGroupNCCL.cpp:616] [Rank 83] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. [rank23]:[E218 15:08:12.719929480 ProcessGroupNCCL.cpp:616] [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. [rank111]:[E218 15:08:12.946865893 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 111] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank71]:[E218 15:08:12.851872547 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 71] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank113]:[E218 15:08:12.066744701 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 113] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank117]:[E218 15:08:12.100449076 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 117] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank53]:[E218 15:08:12.761063008 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 53] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank37]:[E218 15:08:12.514459038 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 37] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank72]:[E218 15:08:12.871283252 ProcessGroupNCCL.cpp:616] [Rank 72] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. [rank92]:[E218 15:08:12.900515062 ProcessGroupNCCL.cpp:616] [Rank 92] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. [rank68]:[E218 15:08:12.859018561 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 68] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank107]:[E218 15:08:12.940822989 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 107] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank98]:[E218 15:08:12.801757444 ProcessGroupNCCL.cpp:616] [Rank 98] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. [rank56]:[E218 15:08:12.913525710 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 56] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank69]:[E218 15:08:12.863024066 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 69] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank61]:[E218 15:08:12.913525670 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 61] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank38]:[E218 15:08:12.545390623 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 38] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank102]:[E218 15:08:12.039213831 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 102] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank27]:[E218 15:08:12.688503978 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 27] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank110]:[E218 15:08:12.961792201 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 110] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank30]:[E218 15:08:12.692294804 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 30] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank46]:[E218 15:08:12.427045212 ProcessGroupNCCL.cpp:616] [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. [rank59]:[E218 15:08:12.646547451 ProcessGroupNCCL.cpp:616] [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. [rank77]:[E218 15:08:12.070485872 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 77] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank66]:[E218 15:08:12.877092060 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 66] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank116]:[E218 15:08:12.905629226 ProcessGroupNCCL.cpp:616] [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. [rank55]:[E218 15:08:12.782283798 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 55] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank16]:[E218 15:08:12.785303195 ProcessGroupNCCL.cpp:616] [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [rank15]:[E218 15:08:12.167823424 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 15] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank26]:[E218 15:08:12.710258659 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 26] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank39]:[E218 15:08:12.570728950 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 39] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank108]:[E218 15:08:12.989199286 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 108] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank43]:[E218 15:08:12.640513058 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 43] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank87]:[E218 15:08:12.910265911 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 87] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank82]:[E218 15:08:12.883214239 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 82] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank91]:[E218 15:08:12.194525223 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 91] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank121]:[E218 15:08:12.757744783 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 121] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank54]:[E218 15:08:12.780936118 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 54] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank57]:[E218 15:08:12.946511291 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 57] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank8]:[E218 15:08:12.201334291 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 8] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank24]:[E218 15:08:12.754354376 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 24] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank127]:[E218 15:08:12.779915061 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 127] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank109]:[E218 15:08:12.973216587 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 109] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank58]:[E218 15:08:12.953413512 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 58] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank62]:[E218 15:08:12.954845508 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 62] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank42]:[E218 15:08:12.641139919 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 42] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank22]:[E218 15:08:12.799246860 ProcessGroupNCCL.cpp:616] [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. [rank115]:[E218 15:08:12.117273432 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 115] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank97]:[E218 15:08:12.830087981 ProcessGroupNCCL.cpp:616] [Rank 97] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank9]:[E218 15:08:12.209504703 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 9] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank126]:[E218 15:08:12.788533381 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 126] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank96]:[E218 15:08:12.116983043 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 96] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank60]:[E218 15:08:12.959152825 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 60] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank17]:[E218 15:08:12.705705729 ProcessGroupNCCL.cpp:616] [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. [rank85]:[E218 15:08:12.929967097 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 85] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank106]:[E218 15:08:12.042848330 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 106] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank41]:[E218 15:08:12.664969930 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 41] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank52]:[E218 15:08:12.785314024 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 52] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank103]:[E218 15:08:12.072959104 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 103] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank48]:[E218 15:08:12.820941281 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 48] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank18]:[E218 15:08:12.962083809 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 18] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank74]:[E218 15:08:12.116996105 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 74] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank123]:[E218 15:08:12.801825996 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 123] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank125]:[E218 15:08:12.799435353 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 125] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank75]:[E218 15:08:12.119984775 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 75] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank90]:[E218 15:08:12.223283851 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 90] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank104]:[E218 15:08:12.029667057 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 104] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank20]:[E218 15:08:12.726368936 ProcessGroupNCCL.cpp:616] [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [rank119]:[E218 15:08:12.150444927 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 119] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank19]:[E218 15:08:12.926739665 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 19] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank11]:[E218 15:08:12.224567154 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 11] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank16]:[E218 15:08:12.015802290 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 16] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank63]:[E218 15:08:12.993091479 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 63] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank83]:[E218 15:08:12.962109887 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 83] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank72]:[E218 15:08:12.154839878 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 72] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank92]:[E218 15:08:12.254855414 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 92] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank98]:[E218 15:08:12.132091058 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 98] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank10]:[E218 15:08:12.242866229 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 10] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank81]:[E218 15:08:12.956331525 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 81] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank116]:[E218 15:08:12.186173206 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 116] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank46]:[E218 15:08:12.724128447 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 46] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank99]:[E218 15:08:12.144806999 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 99] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank36]:[E218 15:08:12.626523295 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 36] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank29]:[E218 15:08:12.788132953 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 29] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank49]:[E218 15:08:12.875875239 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 49] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank59]:[E218 15:08:12.024597393 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 59] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank25]:[E218 15:08:12.800393903 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 25] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank21]:[E218 15:08:12.027302616 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 21] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank23]:[E218 15:08:12.039356433 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 23] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank20]:[E218 15:08:12.042083008 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 20] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank22]:[E218 15:08:12.045295991 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 22] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank97]:[E218 15:08:12.177547603 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 97] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank17]:[E218 15:08:12.048589619 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 17] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank31]:[E218 15:08:14.942412459 ProcessGroupNCCL.cpp:616] [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. [rank31]:[E218 15:08:14.254680649 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 31] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank73]:[E218 15:08:15.181598415 ProcessGroupNCCL.cpp:616] [Rank 73] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. [rank73]:[E218 15:08:15.496723006 ProcessGroupNCCL.cpp:1785] [PG ID 1 PG GUID 1 Rank 73] Exception (either an error or timeout) detected by watchdog at work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank63]:[E218 15:08:20.118882385 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 63] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank61]:[E218 15:08:20.118882895 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 61] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank63]:[E218 15:08:20.118910986 ProcessGroupNCCL.cpp:630] [Rank 63] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank61]:[E218 15:08:20.118914136 ProcessGroupNCCL.cpp:630] [Rank 61] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank63]:[E218 15:08:20.118919656 ProcessGroupNCCL.cpp:636] [Rank 63] To avoid data inconsistency, we are taking the entire process down. [rank61]:[E218 15:08:20.118922956 ProcessGroupNCCL.cpp:636] [Rank 61] To avoid data inconsistency, we are taking the entire process down. [rank125]:[E218 15:08:20.969243403 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 125] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank127]:[E218 15:08:20.969252804 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 127] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank125]:[E218 15:08:20.969277284 ProcessGroupNCCL.cpp:630] [Rank 125] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank125]:[E218 15:08:20.969285454 ProcessGroupNCCL.cpp:636] [Rank 125] To avoid data inconsistency, we are taking the entire process down. [rank127]:[E218 15:08:20.969283664 ProcessGroupNCCL.cpp:630] [Rank 127] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank127]:[E218 15:08:20.969291904 ProcessGroupNCCL.cpp:636] [Rank 127] To avoid data inconsistency, we are taking the entire process down. [rank57]:[E218 15:08:20.162113575 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 57] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank57]:[E218 15:08:20.162143086 ProcessGroupNCCL.cpp:630] [Rank 57] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank57]:[E218 15:08:20.162149486 ProcessGroupNCCL.cpp:636] [Rank 57] To avoid data inconsistency, we are taking the entire process down. [rank121]:[E218 15:08:20.008674579 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 121] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank121]:[E218 15:08:20.008715860 ProcessGroupNCCL.cpp:630] [Rank 121] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank121]:[E218 15:08:20.008723860 ProcessGroupNCCL.cpp:636] [Rank 121] To avoid data inconsistency, we are taking the entire process down. [rank63]:[E218 15:08:20.186092538 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 63] Process group watchdog thread terminated with exception: [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74bf19db9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74becf02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74becf031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74becf03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74bf1a8665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74bf1e694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74bf1e726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank61]:[E218 15:08:20.186088618 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 61] Process group watchdog thread terminated with exception: [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x712adc793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x712a91a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x712a91a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x712a91a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x712adc8ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x712ae1094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x712ae1126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank57]:[E218 15:08:20.186093498 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 57] Process group watchdog thread terminated with exception: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x791c65350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x791c1a62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x791c1a631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x791c1a63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x791c654ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x791c69c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x791c69d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 61] Process group watchdog thread terminated with exception: [Rank 61] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x712adc793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x712a91a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x712a91a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x712a91a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x712adc8ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x712ae1094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x712ae1126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x712adc793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x712a916a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x712adc8ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x712ae1094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x712ae1126850 in /lib/x86_64-linux-gnu/libc.so.6) what(): what(): [PG ID 1 PG GUID 1 Rank 57] Process group watchdog thread terminated with exception: [Rank 57] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600048 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x791c65350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x791c1a62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x791c1a631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x791c1a63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x791c654ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x791c69c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x791c69d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x791c65350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x791c1a2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x791c654ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x791c69c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x791c69d26850 in /lib/x86_64-linux-gnu/libc.so.6) [PG ID 1 PG GUID 1 Rank 63] Process group watchdog thread terminated with exception: [Rank 63] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600064 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74bf19db9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74becf02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74becf031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74becf03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74bf1a8665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74bf1e694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74bf1e726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74bf19db9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x74bececa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x74bf1a8665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x74bf1e694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x74bf1e726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank127]:[E218 15:08:20.028067730 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 127] Process group watchdog thread terminated with exception: [Rank 127] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c569c0e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c565142a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c5651431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c565143361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c569c8735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c56a0a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c56a0b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank121]:[E218 15:08:20.028078670 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 121] Process group watchdog thread terminated with exception: [Rank 121] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c2cb6f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c2c6c62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c2c6c631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c2c6c63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c2cb7c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) [rank10]:[E218 15:08:20.449732576 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 10] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank8]:[E218 15:08:20.449737997 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 8] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank8]:[E218 15:08:20.449759268 ProcessGroupNCCL.cpp:630] [Rank 8] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank10]:[E218 15:08:20.449759658 ProcessGroupNCCL.cpp:630] [Rank 10] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank8]:[E218 15:08:20.449766259 ProcessGroupNCCL.cpp:636] [Rank 8] To avoid data inconsistency, we are taking the entire process down. frame #5: + 0x94ac3 (0x7c2cbbc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c2cbbd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank10]:[E218 15:08:20.449766339 ProcessGroupNCCL.cpp:636] [Rank 10] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 127] Process group watchdog thread terminated with exception: [Rank 127] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c569c0e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c565142a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c5651431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c565143361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c569c8735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c56a0a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c56a0b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c569c0e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c56510a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c569c8735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c56a0a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c56a0b26850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 121] Process group watchdog thread terminated with exception: [Rank 121] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c2cb6f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c2c6c62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c2c6c631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c2c6c63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c2cb7c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c2cbbc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c2cbbd26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c2cb6f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c2c6c2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c2cb7c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c2cbbc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c2cbbd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank125]:[E218 15:08:20.041683082 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 125] Process group watchdog thread terminated with exception: [Rank 125] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f52ef2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f52a462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f52a4631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f52a463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f52efe585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f52f3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f52f3d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 125] Process group watchdog thread terminated with exception: [Rank 125] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f52ef2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f52a462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f52a4631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f52a463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f52efe585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f52f3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f52f3d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f52ef2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f52a42a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f52efe585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f52f3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f52f3d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank14]:[E218 15:08:20.466306177 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 14] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank14]:[E218 15:08:20.466336449 ProcessGroupNCCL.cpp:630] [Rank 14] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank14]:[E218 15:08:20.466344670 ProcessGroupNCCL.cpp:636] [Rank 14] To avoid data inconsistency, we are taking the entire process down. [rank12]:[E218 15:08:20.495050579 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 12] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank12]:[E218 15:08:20.495072461 ProcessGroupNCCL.cpp:630] [Rank 12] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank12]:[E218 15:08:20.495079331 ProcessGroupNCCL.cpp:636] [Rank 12] To avoid data inconsistency, we are taking the entire process down. [rank93]:[E218 15:08:20.495168856 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 93] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank93]:[E218 15:08:20.495203367 ProcessGroupNCCL.cpp:630] [Rank 93] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank93]:[E218 15:08:20.495211727 ProcessGroupNCCL.cpp:636] [Rank 93] To avoid data inconsistency, we are taking the entire process down. [rank123]:[E218 15:08:20.079890701 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 123] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank123]:[E218 15:08:20.079918831 ProcessGroupNCCL.cpp:630] [Rank 123] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank123]:[E218 15:08:20.079925931 ProcessGroupNCCL.cpp:636] [Rank 123] To avoid data inconsistency, we are taking the entire process down. [rank123]:[E218 15:08:20.081858084 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 123] Process group watchdog thread terminated with exception: [Rank 123] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7230c9576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72307e82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72307e831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72307e83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7230c9c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7230cde94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7230cdf26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 123] Process group watchdog thread terminated with exception: [Rank 123] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600078 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7230c9576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72307e82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72307e831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72307e83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7230c9c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7230cde94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7230cdf26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7230c9576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72307e4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7230c9c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7230cde94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7230cdf26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank10]:[E218 15:08:20.505607573 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7936e4793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x793699a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x793699a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x793699a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7936e48ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7936e9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7936e9126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank8]:[E218 15:08:20.505608994 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7137ebadb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7137a0e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7137a0e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7137a0e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7137ec6525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7137f0494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7137f0526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 10] Process group watchdog thread terminated with exception: [Rank 10] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7936e4793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x793699a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x793699a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x793699a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7936e48ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7936e9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7936e9126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7936e4793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7936996a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7936e48ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7936e9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7936e9126850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 8] Process group watchdog thread terminated with exception: [Rank 8] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7137ebadb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7137a0e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7137a0e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7137a0e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7137ec6525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7137f0494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7137f0526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7137ebadb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7137a0aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7137ec6525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7137f0494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7137f0526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank122]:[E218 15:08:20.088578223 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 122] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank122]:[E218 15:08:20.088602444 ProcessGroupNCCL.cpp:630] [Rank 122] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank122]:[E218 15:08:20.088609564 ProcessGroupNCCL.cpp:636] [Rank 122] To avoid data inconsistency, we are taking the entire process down. [rank120]:[E218 15:08:20.088822179 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 120] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank120]:[E218 15:08:20.088849080 ProcessGroupNCCL.cpp:630] [Rank 120] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank120]:[E218 15:08:20.088855189 ProcessGroupNCCL.cpp:636] [Rank 120] To avoid data inconsistency, we are taking the entire process down. [rank120]:[E218 15:08:20.090840134 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 120] Process group watchdog thread terminated with exception: [Rank 120] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7271138e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7270c8c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7270c8c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7270c8c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7271144585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x727118294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x727118326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 120] Process group watchdog thread terminated with exception: [Rank 120] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7271138e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7270c8c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7270c8c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7270c8c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7271144585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x727118294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x727118326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7271138e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7270c88a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7271144585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x727118294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x727118326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank14]:[E218 15:08:20.518744135 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b807152a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b802682a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b8026831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b802683361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b8071c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b8075e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b8075f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 14] Process group watchdog thread terminated with exception: [Rank 14] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b807152a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b802682a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b8026831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b802683361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b8071c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b8075e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b8075f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b807152a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b80264a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b8071c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b8075e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b8075f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank122]:[E218 15:08:20.099983647 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 122] Process group watchdog thread terminated with exception: [Rank 122] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71cb9df2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71cb5322a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71cb53231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71cb5323361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71cb9e6735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71cba2894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71cba2926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 122] Process group watchdog thread terminated with exception: [Rank 122] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600072 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71cb9df2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71cb5322a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71cb53231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71cb5323361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71cb9e6735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71cba2894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71cba2926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71cb9df2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x71cb52ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71cb9e6735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x71cba2894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x71cba2926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank12]:[E218 15:08:20.545640083 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f45e26bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f4597a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4597a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4597a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f45e28175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f45e7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f45e7126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 12] Process group watchdog thread terminated with exception: [Rank 12] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600015 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f45e26bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f4597a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4597a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f4597a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f45e28175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f45e7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f45e7126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f45e26bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f45976a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f45e28175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f45e7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f45e7126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank91]:[E218 15:08:20.552192072 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 91] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank91]:[E218 15:08:20.552219702 ProcessGroupNCCL.cpp:630] [Rank 91] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank91]:[E218 15:08:20.552224612 ProcessGroupNCCL.cpp:636] [Rank 91] To avoid data inconsistency, we are taking the entire process down. [rank93]:[E218 15:08:20.552673641 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 93] Process group watchdog thread terminated with exception: [Rank 93] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d20cdb50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d2082e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d2082e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d2082e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d20cdcab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d20d2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d20d2526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank91]:[E218 15:08:20.554086578 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 91] Process group watchdog thread terminated with exception: [Rank 91] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77f66ab50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77f61fe2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77f61fe31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77f61fe3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77f66acab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77f66f494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77f66f526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 93] Process group watchdog thread terminated with exception: [Rank 93] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600055 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d20cdb50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d2082e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d2082e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d2082e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d20cdcab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d20d2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d20d2526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d20cdb50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7d2082aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7d20cdcab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7d20d2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7d20d2526850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 91] Process group watchdog thread terminated with exception: [Rank 91] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77f66ab50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77f61fe2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77f61fe31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77f61fe3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77f66acab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77f66f494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77f66f526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77f66ab50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77f61faa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77f66acab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77f66f494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77f66f526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank59]:[E218 15:08:20.333368142 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 59] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank59]:[E218 15:08:20.333390972 ProcessGroupNCCL.cpp:630] [Rank 59] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank59]:[E218 15:08:20.333398052 ProcessGroupNCCL.cpp:636] [Rank 59] To avoid data inconsistency, we are taking the entire process down. [rank59]:[E218 15:08:20.335343631 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 59] Process group watchdog thread terminated with exception: [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a5d056e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a5cbaa2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a5cbaa31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a5cbaa3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a5d062585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a5d0a094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a5d0a126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 59] Process group watchdog thread terminated with exception: [Rank 59] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a5d056e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a5cbaa2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a5cbaa31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a5cbaa3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a5d062585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a5d0a094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a5d0a126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a5d056e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a5cba6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a5d062585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a5d0a094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a5d0a126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank56]:[E218 15:08:20.336582621 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 56] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank56]:[E218 15:08:20.336606722 ProcessGroupNCCL.cpp:630] [Rank 56] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank56]:[E218 15:08:20.336613802 ProcessGroupNCCL.cpp:636] [Rank 56] To avoid data inconsistency, we are taking the entire process down. [rank56]:[E218 15:08:20.338443657 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 56] Process group watchdog thread terminated with exception: [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cfe2156c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cfdd6c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cfdd6c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cfdd6c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cfe2225c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cfe26294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cfe26326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 56] Process group watchdog thread terminated with exception: [Rank 56] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cfe2156c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cfdd6c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cfdd6c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cfdd6c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cfe2225c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cfe26294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cfe26326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cfe2156c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7cfdd68a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7cfe2225c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7cfe26294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7cfe26326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank58]:[E218 15:08:20.339092633 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 58] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank58]:[E218 15:08:20.339122684 ProcessGroupNCCL.cpp:630] [Rank 58] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank58]:[E218 15:08:20.339129844 ProcessGroupNCCL.cpp:636] [Rank 58] To avoid data inconsistency, we are taking the entire process down. [rank60]:[E218 15:08:20.339306939 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 60] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank60]:[E218 15:08:20.339323729 ProcessGroupNCCL.cpp:630] [Rank 60] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank60]:[E218 15:08:20.339328099 ProcessGroupNCCL.cpp:636] [Rank 60] To avoid data inconsistency, we are taking the entire process down. [rank95]:[E218 15:08:20.589379642 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 95] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank95]:[E218 15:08:20.589411523 ProcessGroupNCCL.cpp:630] [Rank 95] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank95]:[E218 15:08:20.589431563 ProcessGroupNCCL.cpp:636] [Rank 95] To avoid data inconsistency, we are taking the entire process down. [rank62]:[E218 15:08:20.339865412 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 62] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank62]:[E218 15:08:20.339883503 ProcessGroupNCCL.cpp:630] [Rank 62] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank62]:[E218 15:08:20.339889543 ProcessGroupNCCL.cpp:636] [Rank 62] To avoid data inconsistency, we are taking the entire process down. [rank58]:[E218 15:08:20.341168265 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 58] Process group watchdog thread terminated with exception: [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e1de9d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e1d9f42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e1d9f431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e1d9f43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e1dea15e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e1dee894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e1dee926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank60]:[E218 15:08:20.341167675 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 60] Process group watchdog thread terminated with exception: [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70a9aed2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70a96402a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70a964031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70a96403361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70a9af46d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70a9b3694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70a9b3726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 58] Process group watchdog thread terminated with exception: [Rank 58] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e1de9d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e1d9f42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e1d9f431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e1d9f43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e1dea15e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e1dee894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e1dee926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e1de9d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e1d9f0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e1dea15e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e1dee894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e1dee926850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 60] Process group watchdog thread terminated with exception: [Rank 60] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70a9aed2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70a96402a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70a964031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70a96403361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70a9af46d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70a9b3694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70a9b3726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70a9aed2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x70a963ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70a9af46d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70a9b3694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70a9b3726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank62]:[E218 15:08:20.352223108 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 62] Process group watchdog thread terminated with exception: [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x788ae76bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x788a9ca2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x788a9ca31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x788a9ca3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x788ae78175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x788aec094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x788aec126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 62] Process group watchdog thread terminated with exception: [Rank 62] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x788ae76bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x788a9ca2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x788a9ca31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x788a9ca3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x788ae78175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x788aec094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x788aec126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x788ae76bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x788a9c6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x788ae78175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x788aec094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x788aec126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank126]:[E218 15:08:20.190461656 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 126] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank126]:[E218 15:08:20.190487067 ProcessGroupNCCL.cpp:630] [Rank 126] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank126]:[E218 15:08:20.190492707 ProcessGroupNCCL.cpp:636] [Rank 126] To avoid data inconsistency, we are taking the entire process down. [rank124]:[E218 15:08:20.191642843 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 124] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank124]:[E218 15:08:20.191671513 ProcessGroupNCCL.cpp:630] [Rank 124] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank124]:[E218 15:08:20.191678893 ProcessGroupNCCL.cpp:636] [Rank 124] To avoid data inconsistency, we are taking the entire process down. [rank124]:[E218 15:08:20.193621706 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 124] Process group watchdog thread terminated with exception: [Rank 124] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7610a9593446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76105e82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76105e831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76105e83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7610a96ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7610ade94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7610adf26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 124] Process group watchdog thread terminated with exception: [Rank 124] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600027 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7610a9593446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76105e82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76105e831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76105e83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7610a96ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7610ade94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7610adf26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7610a9593446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x76105e4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7610a96ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7610ade94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7610adf26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank89]:[E218 15:08:20.618077719 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 89] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank89]:[E218 15:08:20.618105979 ProcessGroupNCCL.cpp:630] [Rank 89] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank89]:[E218 15:08:20.618110909 ProcessGroupNCCL.cpp:636] [Rank 89] To avoid data inconsistency, we are taking the entire process down. [rank126]:[E218 15:08:20.203521566 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 126] Process group watchdog thread terminated with exception: [Rank 126] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cda807b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cda35a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cda35a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cda35a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cda8126c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cda85094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cda85126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 126] Process group watchdog thread terminated with exception: [Rank 126] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cda807b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cda35a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cda35a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cda35a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cda8126c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cda85094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cda85126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cda807b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7cda356a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7cda8126c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7cda85094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7cda85126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank95]:[E218 15:08:20.636426494 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 95] Process group watchdog thread terminated with exception: [Rank 95] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x783267fb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78321d22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78321d231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78321d23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x783268a6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78326c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78326c926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 95] Process group watchdog thread terminated with exception: [Rank 95] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x783267fb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78321d22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78321d231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78321d23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x783268a6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78326c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78326c926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x783267fb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78321cea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x783268a6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78326c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78326c926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank15]:[E218 15:08:20.662953521 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 15] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank15]:[E218 15:08:20.662983683 ProcessGroupNCCL.cpp:630] [Rank 15] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank15]:[E218 15:08:20.662988813 ProcessGroupNCCL.cpp:636] [Rank 15] To avoid data inconsistency, we are taking the entire process down. [rank15]:[E218 15:08:20.664866759 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ac1ed6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77abd442a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77abd4431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77abd443361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77ac1f1635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77ac23894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77ac23926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 15] Process group watchdog thread terminated with exception: [Rank 15] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600012 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ac1ed6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77abd442a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77abd4431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77abd443361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77ac1f1635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77ac23894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77ac23926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ac1ed6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77abd40a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77ac1f1635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77ac23894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77ac23926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank9]:[E218 15:08:20.667786711 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 9] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank9]:[E218 15:08:20.667812482 ProcessGroupNCCL.cpp:630] [Rank 9] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank9]:[E218 15:08:20.667821143 ProcessGroupNCCL.cpp:636] [Rank 9] To avoid data inconsistency, we are taking the entire process down. [rank9]:[E218 15:08:20.669680358 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71e62416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71e5d982a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71e5d9831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71e5d983361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71e624e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71e628e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71e628f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank11]:[E218 15:08:20.669965329 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 11] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank11]:[E218 15:08:20.669989980 ProcessGroupNCCL.cpp:630] [Rank 11] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank11]:[E218 15:08:20.669996271 ProcessGroupNCCL.cpp:636] [Rank 11] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 9] Process group watchdog thread terminated with exception: [Rank 9] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600074 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71e62416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71e5d982a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71e5d9831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71e5d983361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71e624e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71e628e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71e628f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71e62416c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x71e5d94a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71e624e5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x71e628e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x71e628f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank13]:[E218 15:08:20.671508900 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 13] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank13]:[E218 15:08:20.671526482 ProcessGroupNCCL.cpp:630] [Rank 13] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank13]:[E218 15:08:20.671530872 ProcessGroupNCCL.cpp:636] [Rank 13] To avoid data inconsistency, we are taking the entire process down. [rank11]:[E218 15:08:20.671917730 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77fa4e2e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77fa0362a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77fa03631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77fa0363361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77fa4ea735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77fa52c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77fa52d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 11] Process group watchdog thread terminated with exception: [Rank 11] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77fa4e2e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77fa0362a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77fa03631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77fa0363361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77fa4ea735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77fa52c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77fa52d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77fa4e2e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77fa032a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77fa4ea735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77fa52c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77fa52d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank13]:[E218 15:08:20.673470052 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74c6b5f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74c66b62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74c66b631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74c66b63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74c6b6c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74c6bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74c6bad26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 13] Process group watchdog thread terminated with exception: [Rank 13] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74c6b5f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x74c66b62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x74c66b631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x74c66b63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x74c6b6c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x74c6bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x74c6bad26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x74c6b5f6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x74c66b2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x74c6b6c5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x74c6bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x74c6bad26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank89]:[E218 15:08:20.676758996 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 89] Process group watchdog thread terminated with exception: [Rank 89] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x784891d93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78484702a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784847031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78484703361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x784891eee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x784896694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x784896726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 89] Process group watchdog thread terminated with exception: [Rank 89] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600038 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x784891d93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78484702a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x784847031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78484703361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x784891eee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x784896694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x784896726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x784891d93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x784846ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x784891eee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x784896694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x784896726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank43]:[E218 15:08:20.159505428 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 43] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank45]:[E218 15:08:20.159504798 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 45] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank45]:[E218 15:08:20.159533279 ProcessGroupNCCL.cpp:630] [Rank 45] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank43]:[E218 15:08:20.159535099 ProcessGroupNCCL.cpp:630] [Rank 43] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank45]:[E218 15:08:20.159538939 ProcessGroupNCCL.cpp:636] [Rank 45] To avoid data inconsistency, we are taking the entire process down. [rank43]:[E218 15:08:20.159543059 ProcessGroupNCCL.cpp:636] [Rank 43] To avoid data inconsistency, we are taking the entire process down. [rank36]:[E218 15:08:20.105483517 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 36] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank38]:[E218 15:08:20.105492678 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 38] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank36]:[E218 15:08:20.105509668 ProcessGroupNCCL.cpp:630] [Rank 36] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank36]:[E218 15:08:20.105516018 ProcessGroupNCCL.cpp:636] [Rank 36] To avoid data inconsistency, we are taking the entire process down. [rank38]:[E218 15:08:20.105515628 ProcessGroupNCCL.cpp:630] [Rank 38] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank38]:[E218 15:08:20.105524558 ProcessGroupNCCL.cpp:636] [Rank 38] To avoid data inconsistency, we are taking the entire process down. [rank88]:[E218 15:08:20.762264764 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 88] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank88]:[E218 15:08:20.762290954 ProcessGroupNCCL.cpp:630] [Rank 88] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank88]:[E218 15:08:20.762297925 ProcessGroupNCCL.cpp:636] [Rank 88] To avoid data inconsistency, we are taking the entire process down. [rank92]:[E218 15:08:20.764612779 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 92] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank92]:[E218 15:08:20.764637160 ProcessGroupNCCL.cpp:630] [Rank 92] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank92]:[E218 15:08:20.764646450 ProcessGroupNCCL.cpp:636] [Rank 92] To avoid data inconsistency, we are taking the entire process down. [rank43]:[E218 15:08:20.215176814 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 43] Process group watchdog thread terminated with exception: [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x780121b2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7800d6e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7800d6e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank94]:[E218 15:08:20.764931296 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 94] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank94]:[E218 15:08:20.764953196 ProcessGroupNCCL.cpp:630] [Rank 94] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank94]:[E218 15:08:20.764957716 ProcessGroupNCCL.cpp:636] [Rank 94] To avoid data inconsistency, we are taking the entire process down. frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7800d6e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7801222735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x780126494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x780126526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank45]:[E218 15:08:20.215196864 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 45] Process group watchdog thread terminated with exception: [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f60c376c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f6078e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6078e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6078e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f60c48525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f60c8294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f60c8326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' [rank90]:[E218 15:08:20.765442456 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 90] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank90]:[E218 15:08:20.765461756 ProcessGroupNCCL.cpp:630] [Rank 90] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank90]:[E218 15:08:20.765468016 ProcessGroupNCCL.cpp:636] [Rank 90] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 43] Process group watchdog thread terminated with exception: [Rank 43] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x780121b2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7800d6e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7800d6e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) [rank92]:[E218 15:08:20.766631489 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 92] Process group watchdog thread terminated with exception: [Rank 92] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x706b2b993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x706ae0c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x706ae0c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7800d6e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7801222735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x780126494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x780126526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x780121b2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7800d6aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x706ae0c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x706b2baee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x706b30294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x706b30326850 in /lib/x86_64-linux-gnu/libc.so.6) frame #2: + 0x145c0 (0x7801222735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x780126494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x780126526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 45] Process group watchdog thread terminated with exception: [Rank 45] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f60c376c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f6078e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6078e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f6078e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f60c48525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f60c8294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f60c8326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f60c376c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f6078aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f60c48525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f60c8294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f60c8326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank94]:[E218 15:08:20.766966705 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 94] Process group watchdog thread terminated with exception: [Rank 94] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70423af50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7041f022a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7041f0231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7041f023361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70423b0ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70423f894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70423f926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 92] Process group watchdog thread terminated with exception: [Rank 92] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600042 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x706b2b993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x706ae0c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x706ae0c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x706ae0c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x706b2baee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x706b30294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x706b30326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x706b2b993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x706ae08a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x706b2baee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x706b30294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x706b30326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank90]:[E218 15:08:20.767311502 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 90] Process group watchdog thread terminated with exception: [Rank 90] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a78de76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a7893e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a7893e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a7893e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a78df45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a78e3494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a78e3526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 94] Process group watchdog thread terminated with exception: [Rank 94] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70423af50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7041f022a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7041f0231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7041f023361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70423b0ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70423f894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70423f926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70423af50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7041efea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70423b0ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70423f894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70423f926850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 90] Process group watchdog thread terminated with exception: [Rank 90] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600097 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a78de76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a7893e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a7893e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a7893e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a78df45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a78e3494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a78e3526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a78de76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a7893aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a78df45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a78e3494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a78e3526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank88]:[E218 15:08:20.779824414 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 88] Process group watchdog thread terminated with exception: [Rank 88] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73b7a216c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73b75782a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73b757831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73b75783361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73b7a25635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73b7a6c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73b7a6d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 88] Process group watchdog thread terminated with exception: [Rank 88] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600043 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73b7a216c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73b75782a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73b757831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73b75783361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73b7a25635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73b7a6c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73b7a6d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73b7a216c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73b7574a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73b7a25635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73b7a6c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73b7a6d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank36]:[E218 15:08:20.160513268 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 36] Process group watchdog thread terminated with exception: [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x758e0acb5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x758dc002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x758dc0031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x758dc003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x758e0ae1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x758e0f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x758e0f726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 36] Process group watchdog thread terminated with exception: [Rank 36] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600062 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x758e0acb5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x758dc002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x758dc0031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x758dc003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x758e0ae1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x758e0f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x758e0f726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x758e0acb5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x758dbfca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x758e0ae1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x758e0f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x758e0f726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank38]:[E218 15:08:20.167642872 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 38] Process group watchdog thread terminated with exception: [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b9fbc376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b9f7162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b9f71631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b9f7163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b9fbca585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b9fc0c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b9fc0d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 38] Process group watchdog thread terminated with exception: [Rank 38] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600017 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b9fbc376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b9f7162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b9f71631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b9f7163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b9fbca585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b9fc0c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b9fc0d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b9fbc376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b9f712a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b9fbca585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b9fc0c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b9fc0d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank34]:[E218 15:08:20.208622540 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 34] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank34]:[E218 15:08:20.208654640 ProcessGroupNCCL.cpp:630] [Rank 34] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank34]:[E218 15:08:20.208662450 ProcessGroupNCCL.cpp:636] [Rank 34] To avoid data inconsistency, we are taking the entire process down. [rank34]:[E218 15:08:20.210516242 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 34] Process group watchdog thread terminated with exception: [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75a183cb7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75a13902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75a139031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75a13903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75a183e125c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75a188694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75a188726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 34] Process group watchdog thread terminated with exception: [Rank 34] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75a183cb7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75a13902a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75a139031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75a13903361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75a183e125c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75a188694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75a188726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75a183cb7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75a138ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75a183e125c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75a188694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75a188726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank47]:[E218 15:08:20.292924072 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 47] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank47]:[E218 15:08:20.292948462 ProcessGroupNCCL.cpp:630] [Rank 47] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank47]:[E218 15:08:20.292954372 ProcessGroupNCCL.cpp:636] [Rank 47] To avoid data inconsistency, we are taking the entire process down. [rank66]:[E218 15:08:20.542870764 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 66] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank66]:[E218 15:08:20.542899105 ProcessGroupNCCL.cpp:630] [Rank 66] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank66]:[E218 15:08:20.542905595 ProcessGroupNCCL.cpp:636] [Rank 66] To avoid data inconsistency, we are taking the entire process down. [rank47]:[E218 15:08:20.294808651 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 47] Process group watchdog thread terminated with exception: [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x740a94db9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x740a4a02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x740a4a031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x740a4a03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x740a9586c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x740a99694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x740a99726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 47] Process group watchdog thread terminated with exception: [Rank 47] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600076 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x740a94db9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x740a4a02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x740a4a031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x740a4a03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x740a9586c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x740a99694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x740a99726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x740a94db9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x740a49ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x740a9586c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x740a99694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x740a99726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank41]:[E218 15:08:21.319855730 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 41] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank41]:[E218 15:08:21.319882131 ProcessGroupNCCL.cpp:630] [Rank 41] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank41]:[E218 15:08:21.319889351 ProcessGroupNCCL.cpp:636] [Rank 41] To avoid data inconsistency, we are taking the entire process down. [rank41]:[E218 15:08:21.321753310 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 41] Process group watchdog thread terminated with exception: [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x729a78776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x729a2da2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x729a2da31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x729a2da3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x729a792555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x729a7d094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x729a7d126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 41] Process group watchdog thread terminated with exception: [Rank 41] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x729a78776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x729a2da2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x729a2da31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x729a2da3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x729a792555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x729a7d094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x729a7d126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x729a78776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x729a2d6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x729a792555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x729a7d094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x729a7d126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank33]:[E218 15:08:21.252584960 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 33] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank33]:[E218 15:08:21.252615450 ProcessGroupNCCL.cpp:630] [Rank 33] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank33]:[E218 15:08:21.252624320 ProcessGroupNCCL.cpp:636] [Rank 33] To avoid data inconsistency, we are taking the entire process down. [rank33]:[E218 15:08:21.254551514 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 33] Process group watchdog thread terminated with exception: [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600008 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b1564cdb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b151a02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b151a031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b151a03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b15658585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1569694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1569726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 33] Process group watchdog thread terminated with exception: [Rank 33] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600008 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b1564cdb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b151a02a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b151a031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b151a03361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b15658585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1569694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1569726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b1564cdb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b1519ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b15658585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b1569694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b1569726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank66]:[E218 15:08:21.601066563 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 66] Process group watchdog thread terminated with exception: [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71b2bc793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71b271a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71b271a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71b271a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71b2bc8ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71b2c1094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71b2c1126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank46]:[E218 15:08:21.352022330 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 46] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank46]:[E218 15:08:21.352053690 ProcessGroupNCCL.cpp:630] [Rank 46] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank46]:[E218 15:08:21.352061661 ProcessGroupNCCL.cpp:636] [Rank 46] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'c10::DistBackendError' [rank46]:[E218 15:08:21.353929660 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 46] Process group watchdog thread terminated with exception: [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c2b6376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73c26b62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73c26b631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73c26b63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73c2b6e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73c2bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73c2bad26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 66] Process group watchdog thread terminated with exception: [Rank 66] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71b2bc793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x71b271a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x71b271a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) what(): [PG ID 1 PG GUID 1 Rank 46] Process group watchdog thread terminated with exception: [Rank 46] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600060 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c2b6376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73c26b62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73c26b631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x71b271a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71b2bc8ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x71b2c1094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x71b2c1126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71b2bc793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x71b2716a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73c26b63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73c2b6e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73c2bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73c2bad26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73c2b6376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73c26b2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71b2bc8ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x71b2c1094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x71b2c1126850 in /lib/x86_64-linux-gnu/libc.so.6) frame #2: + 0x145c0 (0x73c2b6e555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73c2bac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73c2bad26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank100]:[E218 15:08:21.792878633 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 100] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank100]:[E218 15:08:21.792913044 ProcessGroupNCCL.cpp:630] [Rank 100] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank100]:[E218 15:08:21.792921874 ProcessGroupNCCL.cpp:636] [Rank 100] To avoid data inconsistency, we are taking the entire process down. [rank35]:[E218 15:08:21.300476127 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 35] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank35]:[E218 15:08:21.300494988 ProcessGroupNCCL.cpp:630] [Rank 35] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank35]:[E218 15:08:21.300499977 ProcessGroupNCCL.cpp:636] [Rank 35] To avoid data inconsistency, we are taking the entire process down. [rank37]:[E218 15:08:21.301918242 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 37] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank37]:[E218 15:08:21.301937842 ProcessGroupNCCL.cpp:630] [Rank 37] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank37]:[E218 15:08:21.301946902 ProcessGroupNCCL.cpp:636] [Rank 37] To avoid data inconsistency, we are taking the entire process down. [rank35]:[E218 15:08:21.302276268 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 35] Process group watchdog thread terminated with exception: [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7003fcaa3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7003b1e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7003b1e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7003b1e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7003fd45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x700401494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x700401526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank64]:[E218 15:08:21.621974994 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 64] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank64]:[E218 15:08:21.622027755 ProcessGroupNCCL.cpp:630] [Rank 64] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank64]:[E218 15:08:21.622035905 ProcessGroupNCCL.cpp:636] [Rank 64] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 35] Process group watchdog thread terminated with exception: [Rank 35] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7003fcaa3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7003b1e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7003b1e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7003b1e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7003fd45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x700401494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x700401526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7003fcaa3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7003b1aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7003fd45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x700401494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x700401526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank32]:[E218 15:08:21.303683443 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 32] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank37]:[E218 15:08:21.303694263 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 37] Process group watchdog thread terminated with exception: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73d4e016c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73d49582a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73d495831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73d49583361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73d4e055e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73d4e4c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73d4e4d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank32]:[E218 15:08:21.303703963 ProcessGroupNCCL.cpp:630] [Rank 32] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank32]:[E218 15:08:21.303710053 ProcessGroupNCCL.cpp:636] [Rank 32] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 37] Process group watchdog thread terminated with exception: [Rank 37] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600009 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73d4e016c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73d49582a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73d495831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73d49583361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73d4e055e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73d4e4c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73d4e4d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73d4e016c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73d4954a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73d4e055e5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73d4e4c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73d4e4d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank64]:[E218 15:08:21.623852089 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 64] Process group watchdog thread terminated with exception: [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77b61c2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77b5d162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77b5d1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77b5d163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77b61ce585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77b620c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77b620d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 64] Process group watchdog thread terminated with exception: [Rank 64] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77b61c2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77b5d162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77b5d1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77b5d163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77b61ce585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77b620c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77b620d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77b61c2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77b5d12a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77b61ce585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77b620c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77b620d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank39]:[E218 15:08:21.309423382 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 39] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank39]:[E218 15:08:21.309437602 ProcessGroupNCCL.cpp:630] [Rank 39] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank39]:[E218 15:08:21.309441622 ProcessGroupNCCL.cpp:636] [Rank 39] To avoid data inconsistency, we are taking the entire process down. [rank39]:[E218 15:08:21.311174832 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 39] Process group watchdog thread terminated with exception: [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7312de56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x731293c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x731293c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x731293c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7312de9e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7312e3294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7312e3326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 39] Process group watchdog thread terminated with exception: [Rank 39] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600023 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7312de56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x731293c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x731293c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x731293c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7312de9e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7312e3294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7312e3326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7312de56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7312938a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7312de9e45c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7312e3294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7312e3326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank42]:[E218 15:08:21.405198436 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 42] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank42]:[E218 15:08:21.405220057 ProcessGroupNCCL.cpp:630] [Rank 42] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank42]:[E218 15:08:21.405226367 ProcessGroupNCCL.cpp:636] [Rank 42] To avoid data inconsistency, we are taking the entire process down. [rank44]:[E218 15:08:21.406432226 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 44] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank44]:[E218 15:08:21.406451216 ProcessGroupNCCL.cpp:630] [Rank 44] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank44]:[E218 15:08:21.406458066 ProcessGroupNCCL.cpp:636] [Rank 44] To avoid data inconsistency, we are taking the entire process down. [rank42]:[E218 15:08:21.407095706 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 42] Process group watchdog thread terminated with exception: [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73e0facdb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73e0b002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73e0b0031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73e0b003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73e0fb8585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73e0ff694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73e0ff726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 42] Process group watchdog thread terminated with exception: [Rank 42] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600096 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73e0facdb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73e0b002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73e0b0031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73e0b003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73e0fb8585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73e0ff694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73e0ff726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73e0facdb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73e0afca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73e0fb8585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73e0ff694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73e0ff726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank44]:[E218 15:08:21.408274254 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 44] Process group watchdog thread terminated with exception: [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x749fa276c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x749f57e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x749f57e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x749f57e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x749fa38565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x749fa7494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x749fa7526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 44] Process group watchdog thread terminated with exception: [Rank 44] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x749fa276c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x749f57e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x749f57e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x749f57e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x749fa38565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x749fa7494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x749fa7526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x749fa276c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x749f57aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x749fa38565c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x749fa7494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x749fa7526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank40]:[E218 15:08:21.412846335 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 40] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank40]:[E218 15:08:21.412871926 ProcessGroupNCCL.cpp:630] [Rank 40] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank40]:[E218 15:08:21.412878596 ProcessGroupNCCL.cpp:636] [Rank 40] To avoid data inconsistency, we are taking the entire process down. [rank102]:[E218 15:08:21.842348441 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 102] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank102]:[E218 15:08:21.842365192 ProcessGroupNCCL.cpp:630] [Rank 102] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank102]:[E218 15:08:21.842369402 ProcessGroupNCCL.cpp:636] [Rank 102] To avoid data inconsistency, we are taking the entire process down. [rank68]:[E218 15:08:21.667991715 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 68] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank68]:[E218 15:08:21.668036125 ProcessGroupNCCL.cpp:630] [Rank 68] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank68]:[E218 15:08:21.668043506 ProcessGroupNCCL.cpp:636] [Rank 68] To avoid data inconsistency, we are taking the entire process down. [rank68]:[E218 15:08:21.669857239 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 68] Process group watchdog thread terminated with exception: [Rank 68] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe5fb89f446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fe5b0c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe5b0c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe5b0c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fe5fb9fa5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fe600294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fe600326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 68] Process group watchdog thread terminated with exception: [Rank 68] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe5fb89f446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fe5b0c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe5b0c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe5b0c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fe5fb9fa5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fe600294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fe600326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe5fb89f446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7fe5b08a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7fe5fb9fa5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7fe600294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7fe600326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank100]:[E218 15:08:21.852131329 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 100] Process group watchdog thread terminated with exception: [Rank 100] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75bee46e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75be99a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75be99a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75be99a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75bee4e6d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75bee9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75bee9126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 100] Process group watchdog thread terminated with exception: [Rank 100] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75bee46e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x75be99a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x75be99a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x75be99a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75bee4e6d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75bee9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75bee9126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75bee46e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x75be996a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75bee4e6d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75bee9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75bee9126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank32]:[E218 15:08:21.360735829 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 32] Process group watchdog thread terminated with exception: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7648615b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76481682a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x764816831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76481683361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7648620665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x764865e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x764865f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 32] Process group watchdog thread terminated with exception: [Rank 32] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600046 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7648615b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76481682a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x764816831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76481683361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7648620665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x764865e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x764865f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7648615b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7648164a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7648620665c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x764865e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x764865f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank102]:[E218 15:08:21.861117472 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 102] Process group watchdog thread terminated with exception: [Rank 102] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x747cef2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x747ca462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x747ca4631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x747ca463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x747cefe525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x747cf3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x747cf3d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 102] Process group watchdog thread terminated with exception: [Rank 102] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x747cef2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x747ca462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x747ca4631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x747ca463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x747cefe525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x747cf3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x747cf3d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x747cef2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x747ca42a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x747cefe525c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x747cf3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x747cf3d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank70]:[E218 15:08:21.693780847 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 70] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank70]:[E218 15:08:21.693810827 ProcessGroupNCCL.cpp:630] [Rank 70] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank70]:[E218 15:08:21.693817868 ProcessGroupNCCL.cpp:636] [Rank 70] To avoid data inconsistency, we are taking the entire process down. [rank70]:[E218 15:08:21.695709833 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 70] Process group watchdog thread terminated with exception: [Rank 70] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b13ffedb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b13b522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b13b5231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b13b523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b1400a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1404894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1404926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 70] Process group watchdog thread terminated with exception: [Rank 70] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600003 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b13ffedb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b13b522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b13b5231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b13b523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b1400a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b1404894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b1404926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b13ffedb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b13b4ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b1400a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b1404894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b1404926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank40]:[E218 15:08:21.467683958 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 40] Process group watchdog thread terminated with exception: [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ec5a776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77ec0fa2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77ec0fa31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77ec0fa3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77ec5b2555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77ec5f094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77ec5f126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 40] Process group watchdog thread terminated with exception: [Rank 40] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600007 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ec5a776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77ec0fa2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77ec0fa31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77ec0fa3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77ec5b2555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77ec5f094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77ec5f126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ec5a776446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77ec0f6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77ec5b2555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77ec5f094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77ec5f126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank96]:[E218 15:08:21.901097798 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 96] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank96]:[E218 15:08:21.901130389 ProcessGroupNCCL.cpp:630] [Rank 96] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank96]:[E218 15:08:21.901137819 ProcessGroupNCCL.cpp:636] [Rank 96] To avoid data inconsistency, we are taking the entire process down. [rank98]:[E218 15:08:21.908925750 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 98] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank98]:[E218 15:08:21.908957101 ProcessGroupNCCL.cpp:630] [Rank 98] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank98]:[E218 15:08:21.908964861 ProcessGroupNCCL.cpp:636] [Rank 98] To avoid data inconsistency, we are taking the entire process down. [rank98]:[E218 15:08:21.910903166 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 98] Process group watchdog thread terminated with exception: [Rank 98] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b15ddb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70b11302a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70b113031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70b11303361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70b15e8555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70b162694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70b162726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 98] Process group watchdog thread terminated with exception: [Rank 98] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b15ddb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70b11302a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70b113031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70b11303361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70b15e8555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70b162694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70b162726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b15ddb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x70b112ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70b15e8555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70b162694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70b162726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank96]:[E218 15:08:21.956502514 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 96] Process group watchdog thread terminated with exception: [Rank 96] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b927c72a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b9231a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b9231a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b9231a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b927ce735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b9281094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b9281126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 96] Process group watchdog thread terminated with exception: [Rank 96] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600081 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b927c72a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b9231a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b9231a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b9231a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b927ce735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b9281094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b9281126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b927c72a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b92316a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b927ce735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b9281094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b9281126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank69]:[E218 15:08:21.864027671 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 69] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank69]:[E218 15:08:21.864052402 ProcessGroupNCCL.cpp:630] [Rank 69] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank69]:[E218 15:08:21.864059532 ProcessGroupNCCL.cpp:636] [Rank 69] To avoid data inconsistency, we are taking the entire process down. [rank71]:[E218 15:08:21.864101733 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 71] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank71]:[E218 15:08:21.864116643 ProcessGroupNCCL.cpp:630] [Rank 71] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank71]:[E218 15:08:21.864121193 ProcessGroupNCCL.cpp:636] [Rank 71] To avoid data inconsistency, we are taking the entire process down. [rank67]:[E218 15:08:21.864484640 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 67] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank67]:[E218 15:08:21.864504600 ProcessGroupNCCL.cpp:630] [Rank 67] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank67]:[E218 15:08:21.864510470 ProcessGroupNCCL.cpp:636] [Rank 67] To avoid data inconsistency, we are taking the entire process down. [rank65]:[E218 15:08:21.865730283 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 65] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank65]:[E218 15:08:21.865755323 ProcessGroupNCCL.cpp:630] [Rank 65] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank65]:[E218 15:08:21.865762854 ProcessGroupNCCL.cpp:636] [Rank 65] To avoid data inconsistency, we are taking the entire process down. [rank71]:[E218 15:08:21.865919646 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 71] Process group watchdog thread terminated with exception: [Rank 71] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b412993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70b3c7c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70b3c7c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70b3c7c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70b412aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70b417294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70b417326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 71] Process group watchdog thread terminated with exception: [Rank 71] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600034 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b412993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70b3c7c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70b3c7c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70b3c7c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70b412aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70b417294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70b417326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70b412993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x70b3c78a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70b412aee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70b417294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70b417326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank65]:[E218 15:08:21.867686450 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 65] Process group watchdog thread terminated with exception: [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7956275b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7955dc82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7955dc831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7955dc83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7956280555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79562be94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79562bf26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 65] Process group watchdog thread terminated with exception: [Rank 65] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7956275b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7955dc82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7955dc831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7955dc83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7956280555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x79562be94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x79562bf26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7956275b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7955dc4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7956280555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x79562be94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x79562bf26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank99]:[E218 15:08:21.050102324 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 99] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank99]:[E218 15:08:21.050125804 ProcessGroupNCCL.cpp:630] [Rank 99] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank99]:[E218 15:08:21.050130664 ProcessGroupNCCL.cpp:636] [Rank 99] To avoid data inconsistency, we are taking the entire process down. [rank67]:[E218 15:08:21.875238121 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 67] Process group watchdog thread terminated with exception: [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e02c376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77dfe162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77dfe1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77dfe163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77e02ca585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77e030c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77e030d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 67] Process group watchdog thread terminated with exception: [Rank 67] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600029 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e02c376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77dfe162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77dfe1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77dfe163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77e02ca585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77e030c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77e030d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77e02c376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77dfe12a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77e02ca585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77e030c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77e030d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank99]:[E218 15:08:21.052048019 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 99] Process group watchdog thread terminated with exception: [Rank 99] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7789d2b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77898822a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x778988231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77898823361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7789d2fa15c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7789d7894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7789d7926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 99] Process group watchdog thread terminated with exception: [Rank 99] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7789d2b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77898822a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x778988231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77898823361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7789d2fa15c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7789d7894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7789d7926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7789d2b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x778987ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7789d2fa15c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7789d7894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7789d7926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank69]:[E218 15:08:21.883616968 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 69] Process group watchdog thread terminated with exception: [Rank 69] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7de23cf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7de1f262a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7de1f2631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7de1f263361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7de23e05d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7de241c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7de241d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 69] Process group watchdog thread terminated with exception: [Rank 69] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600032 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7de23cf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7de1f262a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7de1f2631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7de1f263361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7de23e05d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7de241c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7de241d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7de23cf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7de1f22a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7de23e05d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7de241c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7de241d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank101]:[E218 15:08:21.078931727 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 101] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank101]:[E218 15:08:21.078954557 ProcessGroupNCCL.cpp:630] [Rank 101] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank101]:[E218 15:08:21.078961147 ProcessGroupNCCL.cpp:636] [Rank 101] To avoid data inconsistency, we are taking the entire process down. [rank101]:[E218 15:08:21.080856882 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 101] Process group watchdog thread terminated with exception: [Rank 101] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7dc5dbf76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7dc59122a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7dc591231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7dc59123361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7dc5dca555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7dc5e0894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7dc5e0926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 101] Process group watchdog thread terminated with exception: [Rank 101] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600004 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7dc5dbf76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7dc59122a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7dc591231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7dc59123361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7dc5dca555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7dc5e0894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7dc5e0926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7dc5dbf76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7dc590ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7dc5dca555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7dc5e0894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7dc5e0926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank103]:[E218 15:08:21.082908049 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 103] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank103]:[E218 15:08:21.082933349 ProcessGroupNCCL.cpp:630] [Rank 103] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank103]:[E218 15:08:21.082940340 ProcessGroupNCCL.cpp:636] [Rank 103] To avoid data inconsistency, we are taking the entire process down. [rank103]:[E218 15:08:21.084779093 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 103] Process group watchdog thread terminated with exception: [Rank 103] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7021bc993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x702171c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702171c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x702171c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7021bcaee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7021c1294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7021c1326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 103] Process group watchdog thread terminated with exception: [Rank 103] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7021bc993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x702171c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x702171c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x702171c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7021bcaee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7021c1294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7021c1326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7021bc993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7021718a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7021bcaee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7021c1294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7021c1326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank97]:[E218 15:08:21.089860575 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 97] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank97]:[E218 15:08:21.089886746 ProcessGroupNCCL.cpp:630] [Rank 97] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank97]:[E218 15:08:21.089893816 ProcessGroupNCCL.cpp:636] [Rank 97] To avoid data inconsistency, we are taking the entire process down. [rank97]:[E218 15:08:21.091797601 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 97] Process group watchdog thread terminated with exception: [Rank 97] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c1ea7576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c1e5c82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c1e5c831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c1e5c83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c1ea80555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c1eabe94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c1eabf26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 97] Process group watchdog thread terminated with exception: [Rank 97] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c1ea7576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c1e5c82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c1e5c831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c1e5c83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c1ea80555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c1eabe94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c1eabf26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c1ea7576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c1e5c4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c1ea80555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c1eabe94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c1eabf26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank27]:[E218 15:08:22.848943007 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 27] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank27]:[E218 15:08:22.848976789 ProcessGroupNCCL.cpp:630] [Rank 27] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank27]:[E218 15:08:22.848983979 ProcessGroupNCCL.cpp:636] [Rank 27] To avoid data inconsistency, we are taking the entire process down. [rank27]:[E218 15:08:22.906081423 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 27] Process group watchdog thread terminated with exception: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9efd76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f9eb2e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9eb2e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f9eb2e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f9efe45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f9f02294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f9f02326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 27] Process group watchdog thread terminated with exception: [Rank 27] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9efd76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f9eb2e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f9eb2e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f9eb2e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f9efe45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f9f02294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f9f02326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f9efd76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f9eb2aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f9efe45c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f9f02294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f9f02326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank104]:[E218 15:08:22.247237588 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 104] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank104]:[E218 15:08:22.247269419 ProcessGroupNCCL.cpp:630] [Rank 104] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank104]:[E218 15:08:22.247278220 ProcessGroupNCCL.cpp:636] [Rank 104] To avoid data inconsistency, we are taking the entire process down. [rank104]:[E218 15:08:22.319823974 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 104] Process group watchdog thread terminated with exception: [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7125ffb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7125b522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7125b5231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7125b523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x712600c595c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x712604694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x712604726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 104] Process group watchdog thread terminated with exception: [Rank 104] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600045 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7125ffb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7125b522a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7125b5231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7125b523361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x712600c595c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x712604694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x712604726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7125ffb6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7125b4ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x712600c595c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x712604694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x712604726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank29]:[E218 15:08:22.114308690 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 29] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank29]:[E218 15:08:22.114339451 ProcessGroupNCCL.cpp:630] [Rank 29] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank29]:[E218 15:08:22.114347752 ProcessGroupNCCL.cpp:636] [Rank 29] To avoid data inconsistency, we are taking the entire process down. [rank29]:[E218 15:08:22.136189547 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 29] Process group watchdog thread terminated with exception: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x752133976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7520e8c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7520e8c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7520e8c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7521340585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x752138294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x752138326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 29] Process group watchdog thread terminated with exception: [Rank 29] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600094 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x752133976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7520e8c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7520e8c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7520e8c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7521340585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x752138294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x752138326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x752133976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7520e88a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7521340585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x752138294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x752138326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank106]:[E218 15:08:22.451897509 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 106] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank106]:[E218 15:08:22.451922510 ProcessGroupNCCL.cpp:630] [Rank 106] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank106]:[E218 15:08:22.451927681 ProcessGroupNCCL.cpp:636] [Rank 106] To avoid data inconsistency, we are taking the entire process down. [rank106]:[E218 15:08:22.485915417 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 106] Process group watchdog thread terminated with exception: [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x756ef4793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x756ea9a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x756ea9a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x756ea9a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x756ef48ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x756ef9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x756ef9126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 106] Process group watchdog thread terminated with exception: [Rank 106] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600099 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x756ef4793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x756ea9a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x756ea9a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x756ea9a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x756ef48ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x756ef9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x756ef9126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x756ef4793446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x756ea96a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x756ef48ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x756ef9094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x756ef9126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank108]:[E218 15:08:22.499206354 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 108] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank108]:[E218 15:08:22.499237906 ProcessGroupNCCL.cpp:630] [Rank 108] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank108]:[E218 15:08:22.499246606 ProcessGroupNCCL.cpp:636] [Rank 108] To avoid data inconsistency, we are taking the entire process down. [rank108]:[E218 15:08:22.501128518 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 108] Process group watchdog thread terminated with exception: [Rank 108] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78359f376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78355462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x783554631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78355463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78359fa585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7835a3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7835a3d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 108] Process group watchdog thread terminated with exception: [Rank 108] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600071 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78359f376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78355462a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x783554631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78355463361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78359fa585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7835a3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7835a3d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78359f376446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7835542a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78359fa585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7835a3c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7835a3d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank22]:[E218 15:08:22.498617161 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 22] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank22]:[E218 15:08:22.498647993 ProcessGroupNCCL.cpp:630] [Rank 22] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank22]:[E218 15:08:22.498654164 ProcessGroupNCCL.cpp:636] [Rank 22] To avoid data inconsistency, we are taking the entire process down. [rank110]:[E218 15:08:22.570260006 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 110] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank110]:[E218 15:08:22.570285987 ProcessGroupNCCL.cpp:630] [Rank 110] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank110]:[E218 15:08:22.570293397 ProcessGroupNCCL.cpp:636] [Rank 110] To avoid data inconsistency, we are taking the entire process down. [rank105]:[E218 15:08:22.570336560 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 105] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank105]:[E218 15:08:22.570358171 ProcessGroupNCCL.cpp:630] [Rank 105] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank105]:[E218 15:08:22.570365171 ProcessGroupNCCL.cpp:636] [Rank 105] To avoid data inconsistency, we are taking the entire process down. [rank51]:[E218 15:08:22.377409996 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 51] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank55]:[E218 15:08:22.377413156 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 55] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank52]:[E218 15:08:22.377419486 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 52] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank50]:[E218 15:08:22.377420826 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 50] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank55]:[E218 15:08:22.377437716 ProcessGroupNCCL.cpp:630] [Rank 55] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank51]:[E218 15:08:22.377439587 ProcessGroupNCCL.cpp:630] [Rank 51] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank52]:[E218 15:08:22.377441736 ProcessGroupNCCL.cpp:630] [Rank 52] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank55]:[E218 15:08:22.377443587 ProcessGroupNCCL.cpp:636] [Rank 55] To avoid data inconsistency, we are taking the entire process down. [rank51]:[E218 15:08:22.377446657 ProcessGroupNCCL.cpp:636] [Rank 51] To avoid data inconsistency, we are taking the entire process down. [rank52]:[E218 15:08:22.377447087 ProcessGroupNCCL.cpp:636] [Rank 52] To avoid data inconsistency, we are taking the entire process down. [rank50]:[E218 15:08:22.377446017 ProcessGroupNCCL.cpp:630] [Rank 50] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank50]:[E218 15:08:22.377452167 ProcessGroupNCCL.cpp:636] [Rank 50] To avoid data inconsistency, we are taking the entire process down. [rank110]:[E218 15:08:22.572270734 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 110] Process group watchdog thread terminated with exception: [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78eb7b96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78eb3102a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78eb31031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78eb3103361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78eb7bd635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78eb80494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78eb80526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank105]:[E218 15:08:22.572320036 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 105] Process group watchdog thread terminated with exception: [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x772f496a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x772efea2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x772efea31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x772efea3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x772f4a05c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x772f4e094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x772f4e126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 110] Process group watchdog thread terminated with exception: [Rank 110] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78eb7b96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78eb3102a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78eb31031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78eb3103361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78eb7bd635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78eb80494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78eb80526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78eb7b96c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78eb30ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78eb7bd635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78eb80494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78eb80526850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 105] Process group watchdog thread terminated with exception: [Rank 105] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x772f496a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x772efea2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x772efea31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x772efea3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x772f4a05c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x772f4e094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x772f4e126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x772f496a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x772efe6a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x772f4a05c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x772f4e094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x772f4e126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank20]:[E218 15:08:22.534969482 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 20] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank20]:[E218 15:08:22.534993814 ProcessGroupNCCL.cpp:630] [Rank 20] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank20]:[E218 15:08:22.534999314 ProcessGroupNCCL.cpp:636] [Rank 20] To avoid data inconsistency, we are taking the entire process down. [rank16]:[E218 15:08:22.541191323 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 16] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank16]:[E218 15:08:22.541222995 ProcessGroupNCCL.cpp:630] [Rank 16] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank16]:[E218 15:08:22.541230435 ProcessGroupNCCL.cpp:636] [Rank 16] To avoid data inconsistency, we are taking the entire process down. [rank109]:[E218 15:08:22.603173099 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 109] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank109]:[E218 15:08:22.603191691 ProcessGroupNCCL.cpp:630] [Rank 109] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank109]:[E218 15:08:22.603197781 ProcessGroupNCCL.cpp:636] [Rank 109] To avoid data inconsistency, we are taking the entire process down. [rank109]:[E218 15:08:22.604958897 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 109] Process group watchdog thread terminated with exception: [Rank 109] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f48c3193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f487842a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4878431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f487843361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f48c32ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f48c7a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f48c7b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 109] Process group watchdog thread terminated with exception: [Rank 109] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600013 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f48c3193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f487842a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f4878431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f487843361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f48c32ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f48c7a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f48c7b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f48c3193446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f48780a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f48c32ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f48c7a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f48c7b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank20]:[E218 15:08:22.557060489 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 20] Process group watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a8932fb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a88e822a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a88e8231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a88e823361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a8933a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a8937894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a8937926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank22]:[E218 15:08:22.557061009 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 22] Process group watchdog thread terminated with exception: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x729ef8f76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x729eae22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x729eae231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x729eae23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x729ef9a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x729efd894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x729efd926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 20] Process group watchdog thread terminated with exception: [Rank 20] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a8932fb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a88e822a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a88e8231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a88e823361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a8933a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a8937894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a8937926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a8932fb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a88e7ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a8933a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a8937894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a8937926850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 22] Process group watchdog thread terminated with exception: [Rank 22] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x729ef8f76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x729eae22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x729eae231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x729eae23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x729ef9a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x729efd894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x729efd926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x729ef8f76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x729eadea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x729ef9a555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x729efd894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x729efd926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank16]:[E218 15:08:22.564783924 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7db7ccedb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7db78222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7db782231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7db78223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7db7cda585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7db7d1894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7db7d1926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 16] Process group watchdog thread terminated with exception: [Rank 16] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7db7ccedb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7db78222a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7db782231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7db78223361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7db7cda585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7db7d1894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7db7d1926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7db7ccedb446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7db781ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7db7cda585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7db7d1894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7db7d1926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank51]:[E218 15:08:22.434727160 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 51] Process group watchdog thread terminated with exception: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7daa2876c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7da9dde2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7da9dde31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7da9dde3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7daa2945c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7daa2d294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7daa2d326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank55]:[E218 15:08:22.434732031 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 55] Process group watchdog thread terminated with exception: [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c3a0cdb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c39c202a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c39c2031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c39c203361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c3a0d86c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c3a11694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c3a11726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank52]:[E218 15:08:22.434737011 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 52] Process group watchdog thread terminated with exception: [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76954c4bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76950182a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x769501831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76950183361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x76954c6175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x769550e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x769550f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 51] Process group watchdog thread terminated with exception: [Rank 51] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600068 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7daa2876c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7da9dde2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7da9dde31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7da9dde3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7daa2945c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7daa2d294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7daa2d326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7daa2876c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7da9ddaa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7daa2945c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7daa2d294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7daa2d326850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 52] Process group watchdog thread terminated with exception: [Rank 52] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600035 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76954c4bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76950182a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x769501831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76950183361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x76954c6175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x769550e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x769550f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76954c4bc446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7695014a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x76954c6175c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x769550e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x769550f26850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 55] Process group watchdog thread terminated with exception: [Rank 55] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600044 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c3a0cdb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c39c202a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c39c2031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c39c203361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c3a0d86c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c3a11694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c3a11726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c3a0cdb9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c39c1ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c3a0d86c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c3a11694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c3a11726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank50]:[E218 15:08:22.442855603 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 50] Process group watchdog thread terminated with exception: [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7bda20d2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7bd9d602a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7bd9d6031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7bd9d603361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7bda2146d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7bda25694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7bda25726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 50] Process group watchdog thread terminated with exception: [Rank 50] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7bda20d2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7bd9d602a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7bd9d6031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7bd9d603361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7bda2146d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7bda25694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7bda25726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7bda20d2a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7bd9d5ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7bda2146d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7bda25694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7bda25726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank30]:[E218 15:08:23.382523225 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 30] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank30]:[E218 15:08:23.382551167 ProcessGroupNCCL.cpp:630] [Rank 30] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank30]:[E218 15:08:23.382558667 ProcessGroupNCCL.cpp:636] [Rank 30] To avoid data inconsistency, we are taking the entire process down. [rank30]:[E218 15:08:23.384470736 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 30] Process group watchdog thread terminated with exception: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ffb93576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ffb4882a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ffb48831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ffb4883361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ffb93c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ffb97e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ffb97f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 30] Process group watchdog thread terminated with exception: [Rank 30] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ffb93576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ffb4882a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ffb48831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ffb4883361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ffb93c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ffb97e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ffb97f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ffb93576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ffb484a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ffb93c585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ffb97e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ffb97f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank28]:[E218 15:08:23.388343137 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 28] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank28]:[E218 15:08:23.388372678 ProcessGroupNCCL.cpp:630] [Rank 28] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank28]:[E218 15:08:23.388380719 ProcessGroupNCCL.cpp:636] [Rank 28] To avoid data inconsistency, we are taking the entire process down. [rank28]:[E218 15:08:23.390381503 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 28] Process group watchdog thread terminated with exception: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e407b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72e3bd22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72e3bd231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72e3bd23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72e40885c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72e40c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72e40c926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 28] Process group watchdog thread terminated with exception: [Rank 28] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e407b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72e3bd22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72e3bd231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72e3bd23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72e40885c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72e40c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72e40c926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e407b6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72e3bcea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72e40885c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72e40c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72e40c926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank23]:[E218 15:08:23.615864076 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 23] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank23]:[E218 15:08:23.615889198 ProcessGroupNCCL.cpp:630] [Rank 23] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank23]:[E218 15:08:23.615896298 ProcessGroupNCCL.cpp:636] [Rank 23] To avoid data inconsistency, we are taking the entire process down. [rank23]:[E218 15:08:23.617727671 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 23] Process group watchdog thread terminated with exception: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cecd24e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cec8782a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cec87831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cec8783361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cecd2c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cecd6e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cecd6f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 23] Process group watchdog thread terminated with exception: [Rank 23] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600061 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cecd24e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7cec8782a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7cec87831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7cec8783361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7cecd2c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7cecd6e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7cecd6f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7cecd24e7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7cec874a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7cecd2c735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7cecd6e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7cecd6f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank26]:[E218 15:08:23.398324125 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 26] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank26]:[E218 15:08:23.398361687 ProcessGroupNCCL.cpp:630] [Rank 26] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank26]:[E218 15:08:23.398369578 ProcessGroupNCCL.cpp:636] [Rank 26] To avoid data inconsistency, we are taking the entire process down. [rank21]:[E218 15:08:23.618455363 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 21] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank21]:[E218 15:08:23.618476954 ProcessGroupNCCL.cpp:630] [Rank 21] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank21]:[E218 15:08:23.618484424 ProcessGroupNCCL.cpp:636] [Rank 21] To avoid data inconsistency, we are taking the entire process down. [rank21]:[E218 15:08:23.620281986 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 21] Process group watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79509e16c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79505382a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x795053831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79505383361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x79509e5635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7950a2c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7950a2d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 21] Process group watchdog thread terminated with exception: [Rank 21] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79509e16c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79505382a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x795053831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79505383361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x79509e5635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7950a2c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7950a2d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x79509e16c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7950534a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x79509e5635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7950a2c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7950a2d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank26]:[E218 15:08:23.420591384 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 26] Process group watchdog thread terminated with exception: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ccdb26c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ccd67a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ccd67a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ccd67a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ccdb281c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ccdb7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ccdb7126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 26] Process group watchdog thread terminated with exception: [Rank 26] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600002 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ccdb26c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ccd67a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ccd67a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ccd67a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ccdb281c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ccdb7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ccdb7126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ccdb26c1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ccd676a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ccdb281c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ccdb7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ccdb7126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank87]:[E218 15:08:23.648517394 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 87] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank86]:[E218 15:08:23.648523344 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 86] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank87]:[E218 15:08:23.648545535 ProcessGroupNCCL.cpp:630] [Rank 87] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank86]:[E218 15:08:23.648550205 ProcessGroupNCCL.cpp:630] [Rank 86] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank87]:[E218 15:08:23.648553265 ProcessGroupNCCL.cpp:636] [Rank 87] To avoid data inconsistency, we are taking the entire process down. [rank86]:[E218 15:08:23.648556445 ProcessGroupNCCL.cpp:636] [Rank 86] To avoid data inconsistency, we are taking the entire process down. [rank83]:[E218 15:08:23.668349324 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 83] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank83]:[E218 15:08:23.668378165 ProcessGroupNCCL.cpp:630] [Rank 83] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank83]:[E218 15:08:23.668386685 ProcessGroupNCCL.cpp:636] [Rank 83] To avoid data inconsistency, we are taking the entire process down. [rank107]:[E218 15:08:23.754393657 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 107] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank107]:[E218 15:08:23.754418388 ProcessGroupNCCL.cpp:630] [Rank 107] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank107]:[E218 15:08:23.754424329 ProcessGroupNCCL.cpp:636] [Rank 107] To avoid data inconsistency, we are taking the entire process down. [rank107]:[E218 15:08:23.756199665 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 107] Process group watchdog thread terminated with exception: [Rank 107] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bd4f1b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73bd0442a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73bd04431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73bd0443361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73bd4fc6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73bd53a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73bd53b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank113]:[E218 15:08:23.871032216 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 113] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank113]:[E218 15:08:23.871062787 ProcessGroupNCCL.cpp:630] [Rank 113] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank112]:[E218 15:08:23.871053787 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 112] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank113]:[E218 15:08:23.871071767 ProcessGroupNCCL.cpp:636] [Rank 113] To avoid data inconsistency, we are taking the entire process down. [rank112]:[E218 15:08:23.871077627 ProcessGroupNCCL.cpp:630] [Rank 112] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank112]:[E218 15:08:23.871084967 ProcessGroupNCCL.cpp:636] [Rank 112] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 107] Process group watchdog thread terminated with exception: [Rank 107] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bd4f1b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73bd0442a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73bd04431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73bd0443361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x73bd4fc6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x73bd53a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x73bd53b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x73bd4f1b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73bd040a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x73bd4fc6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x73bd53a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x73bd53b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank24]:[E218 15:08:23.491327595 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 24] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank24]:[E218 15:08:23.491358447 ProcessGroupNCCL.cpp:630] [Rank 24] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank24]:[E218 15:08:23.491366997 ProcessGroupNCCL.cpp:636] [Rank 24] To avoid data inconsistency, we are taking the entire process down. [rank111]:[E218 15:08:23.777399248 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 111] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank111]:[E218 15:08:23.777427609 ProcessGroupNCCL.cpp:630] [Rank 111] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank111]:[E218 15:08:23.777436410 ProcessGroupNCCL.cpp:636] [Rank 111] To avoid data inconsistency, we are taking the entire process down. [rank117]:[E218 15:08:23.892829887 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 117] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank117]:[E218 15:08:23.892860568 ProcessGroupNCCL.cpp:630] [Rank 117] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank117]:[E218 15:08:23.892867608 ProcessGroupNCCL.cpp:636] [Rank 117] To avoid data inconsistency, we are taking the entire process down. [rank111]:[E218 15:08:23.779720751 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 111] Process group watchdog thread terminated with exception: [Rank 111] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744d6d504446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x744d2282a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x744d22831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x744d2283361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x744d6d65f5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x744d71e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x744d71f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 111] Process group watchdog thread terminated with exception: [Rank 111] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744d6d504446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x744d2282a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x744d22831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x744d2283361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x744d6d65f5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x744d71e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x744d71f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x744d6d504446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x744d224a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x744d6d65f5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x744d71e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x744d71f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank86]:[E218 15:08:23.705685435 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 86] Process group watchdog thread terminated with exception: [Rank 86] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a2c73f50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a2c2922a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a2c29231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a2c2923361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a2c740ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a2c78894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a2c78926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank87]:[E218 15:08:23.705682255 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 87] Process group watchdog thread terminated with exception: [Rank 87] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af7dbb76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7af790e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af790e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af790e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7af7dc6555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7af7e0494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7af7e0526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank83]:[E218 15:08:23.705701195 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 83] Process group watchdog thread terminated with exception: [Rank 83] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7ae2ab7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f7a97e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7a97e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f7a97e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f7ae2c125c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f7ae7494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f7ae7526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' [rank17]:[E218 15:08:23.730999612 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 17] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank17]:[E218 15:08:23.731041924 ProcessGroupNCCL.cpp:630] [Rank 17] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank17]:[E218 15:08:23.731048485 ProcessGroupNCCL.cpp:636] [Rank 17] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 1 PG GUID 1 Rank 87] Process group watchdog thread terminated with exception: [Rank 87] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600036 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af7dbb76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7af790e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af790e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af790e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7af7dc6555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7af7e0494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7af7e0526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af7dbb76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7af790aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7af7dc6555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7af7e0494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7af7e0526850 in /lib/x86_64-linux-gnu/libc.so.6) what(): what(): [PG ID 1 PG GUID 1 Rank 83] Process group watchdog thread terminated with exception: [Rank 83] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600088 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7ae2ab7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f7a97e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f7a97e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f7a97e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f7ae2c125c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f7ae7494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f7ae7526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f7ae2ab7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f7a97aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f7ae2c125c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f7ae7494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f7ae7526850 in /lib/x86_64-linux-gnu/libc.so.6) [PG ID 1 PG GUID 1 Rank 86] Process group watchdog thread terminated with exception: [Rank 86] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600050 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a2c73f50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a2c2922a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a2c29231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a2c2923361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a2c740ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a2c78894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a2c78926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a2c73f50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a2c28ea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a2c740ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a2c78894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a2c78926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank54]:[E218 15:08:23.593669537 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 54] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank54]:[E218 15:08:23.593697348 ProcessGroupNCCL.cpp:630] [Rank 54] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank54]:[E218 15:08:23.593705268 ProcessGroupNCCL.cpp:636] [Rank 54] To avoid data inconsistency, we are taking the entire process down. [rank17]:[E218 15:08:23.732860667 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 17] Process group watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71570bf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7156c162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7156c1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7156c163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71570d05d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x715710c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x715710d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 17] Process group watchdog thread terminated with exception: [Rank 17] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600059 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71570bf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7156c162a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7156c1631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7156c163361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x71570d05d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x715710c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x715710d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x71570bf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7156c12a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x71570d05d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x715710c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x715710d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank54]:[E218 15:08:23.595579596 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 54] Process group watchdog thread terminated with exception: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x728e1b2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x728dd062a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x728dd0631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x728dd063361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x728e1be585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x728e1fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x728e1fd26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 54] Process group watchdog thread terminated with exception: [Rank 54] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600016 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x728e1b2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x728dd062a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x728dd0631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x728dd063361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x728e1be585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x728e1fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x728e1fd26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x728e1b2db446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x728dd02a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x728e1be585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x728e1fc94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x728e1fd26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank49]:[E218 15:08:23.607877872 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 49] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank49]:[E218 15:08:23.607901162 ProcessGroupNCCL.cpp:630] [Rank 49] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank49]:[E218 15:08:23.607905732 ProcessGroupNCCL.cpp:636] [Rank 49] To avoid data inconsistency, we are taking the entire process down. [rank49]:[E218 15:08:23.609685409 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 49] Process group watchdog thread terminated with exception: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7315a4d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73155a42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73155a431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73155a43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7315a5a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7315a9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7315a9b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 49] Process group watchdog thread terminated with exception: [Rank 49] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600090 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7315a4d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x73155a42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x73155a431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x73155a43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7315a5a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7315a9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7315a9b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7315a4d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x73155a0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7315a5a5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7315a9a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7315a9b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank116]:[E218 15:08:23.921886733 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 116] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank116]:[E218 15:08:23.921913143 ProcessGroupNCCL.cpp:630] [Rank 116] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank116]:[E218 15:08:23.921920103 ProcessGroupNCCL.cpp:636] [Rank 116] To avoid data inconsistency, we are taking the entire process down. [rank113]:[E218 15:08:23.926732945 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 113] Process group watchdog thread terminated with exception: [Rank 113] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f30fe176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f30b342a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f30b3431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f30b343361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f30fec555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f3102a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f3102b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank116]:[E218 15:08:23.926732885 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 116] Process group watchdog thread terminated with exception: [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75754a993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7574ffc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7574ffc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7574ffc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75754aaee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75754f294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75754f326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' [rank24]:[E218 15:08:23.537887109 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 24] Process group watchdog thread terminated with exception: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70f5dcd6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70f59242a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70f592431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70f59243361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70f5dda5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70f5e1a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70f5e1b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 24] Process group watchdog thread terminated with exception: [Rank 24] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600092 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70f5dcd6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x70f59242a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70f592431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70f59243361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70f5dda5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x70f5e1a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x70f5e1b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70f5dcd6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x70f5920a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70f5dda5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x70f5e1a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x70f5e1b26850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 116] Process group watchdog thread terminated with exception: [Rank 116] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600070 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75754a993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7574ffc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7574ffc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7574ffc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75754aaee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x75754f294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x75754f326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75754a993446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7574ff8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75754aaee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x75754f294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x75754f326850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 113] Process group watchdog thread terminated with exception: [Rank 113] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600028 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f30fe176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7f30b342a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f30b3431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f30b343361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7f30fec555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7f3102a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7f3102b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f30fe176446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7f30b30a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7f30fec555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7f3102a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7f3102b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank118]:[E218 15:08:23.938880462 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 118] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank118]:[E218 15:08:23.938908752 ProcessGroupNCCL.cpp:630] [Rank 118] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank118]:[E218 15:08:23.938915973 ProcessGroupNCCL.cpp:636] [Rank 118] To avoid data inconsistency, we are taking the entire process down. [rank85]:[E218 15:08:23.747953090 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 85] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank85]:[E218 15:08:23.747982651 ProcessGroupNCCL.cpp:630] [Rank 85] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank85]:[E218 15:08:23.747989841 ProcessGroupNCCL.cpp:636] [Rank 85] To avoid data inconsistency, we are taking the entire process down. [rank85]:[E218 15:08:23.749869011 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 85] Process group watchdog thread terminated with exception: [Rank 85] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72b9aaf6a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72b96022a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72b960231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72b96023361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72b9ab6585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72b9af894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72b9af926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 85] Process group watchdog thread terminated with exception: [Rank 85] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72b9aaf6a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72b96022a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72b960231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72b96023361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72b9ab6585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72b9af894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72b9af926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72b9aaf6a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72b95fea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72b9ab6585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72b9af894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72b9af926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank31]:[E218 15:08:23.554719398 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 31] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank31]:[E218 15:08:23.554745220 ProcessGroupNCCL.cpp:630] [Rank 31] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank31]:[E218 15:08:23.554752920 ProcessGroupNCCL.cpp:636] [Rank 31] To avoid data inconsistency, we are taking the entire process down. [rank31]:[E218 15:08:23.556613756 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x785aadd50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x785a6302a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x785a63031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x785a6303361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x785aadeab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x785ab2694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x785ab2726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 31] Process group watchdog thread terminated with exception: [Rank 31] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600041 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x785aadd50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x785a6302a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x785a63031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x785a6303361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x785aadeab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x785ab2694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x785ab2726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x785aadd50446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x785a62ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x785aadeab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x785ab2694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x785ab2726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank117]:[E218 15:08:23.956234268 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 117] Process group watchdog thread terminated with exception: [Rank 117] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af96556c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7af91ac2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af91ac31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af91ac3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7af96665d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7af96a294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7af96a326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 117] Process group watchdog thread terminated with exception: [Rank 117] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600065 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af96556c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7af91ac2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7af91ac31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7af91ac3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7af96665d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7af96a294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7af96a326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7af96556c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7af91a8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7af96665d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7af96a294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7af96a326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank82]:[E218 15:08:23.765187366 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 82] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank82]:[E218 15:08:23.765205576 ProcessGroupNCCL.cpp:630] [Rank 82] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank82]:[E218 15:08:23.765211906 ProcessGroupNCCL.cpp:636] [Rank 82] To avoid data inconsistency, we are taking the entire process down. [rank82]:[E218 15:08:23.766982304 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 82] Process group watchdog thread terminated with exception: [Rank 82] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ca78cf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ca74262a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ca742631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ca74263361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ca78dc5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ca791c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ca791d26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 82] Process group watchdog thread terminated with exception: [Rank 82] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ca78cf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7ca74262a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7ca742631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7ca74263361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7ca78dc5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7ca791c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7ca791d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7ca78cf6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7ca7422a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7ca78dc5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7ca791c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7ca791d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank112]:[E218 15:08:23.965871872 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 112] Process group watchdog thread terminated with exception: [Rank 112] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ca2312a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77c9d842a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77c9d8431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77c9d843361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77ca238735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77ca27a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77ca27b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 112] Process group watchdog thread terminated with exception: [Rank 112] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600019 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ca2312a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x77c9d842a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x77c9d8431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x77c9d843361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x77ca238735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x77ca27a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x77ca27b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x77ca2312a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x77c9d80a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x77ca238735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x77ca27a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x77ca27b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank25]:[E218 15:08:23.585816060 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 25] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank25]:[E218 15:08:23.585837431 ProcessGroupNCCL.cpp:630] [Rank 25] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank25]:[E218 15:08:23.585844252 ProcessGroupNCCL.cpp:636] [Rank 25] To avoid data inconsistency, we are taking the entire process down. [rank25]:[E218 15:08:23.587661995 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 25] Process group watchdog thread terminated with exception: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e22d7d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e228d42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e228d431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e228d43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e22d81635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e22dc894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e22dc926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 25] Process group watchdog thread terminated with exception: [Rank 25] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600040 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e22d7d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7e228d42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7e228d431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7e228d43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7e22d81635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7e22dc894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7e22dc926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7e22d7d6c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7e228d0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7e22d81635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7e22dc894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7e22dc926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank80]:[E218 15:08:23.790467401 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 80] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank80]:[E218 15:08:23.790491431 ProcessGroupNCCL.cpp:630] [Rank 80] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank80]:[E218 15:08:23.790496862 ProcessGroupNCCL.cpp:636] [Rank 80] To avoid data inconsistency, we are taking the entire process down. [rank80]:[E218 15:08:23.792307090 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 80] Process group watchdog thread terminated with exception: [Rank 80] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75652bb76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7564e0e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7564e0e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7564e0e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75652c2585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x756530494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x756530526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 80] Process group watchdog thread terminated with exception: [Rank 80] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600010 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75652bb76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7564e0e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7564e0e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7564e0e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x75652c2585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x756530494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x756530526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x75652bb76446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7564e0aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x75652c2585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x756530494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x756530526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank19]:[E218 15:08:23.819632263 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 19] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank19]:[E218 15:08:23.819657594 ProcessGroupNCCL.cpp:630] [Rank 19] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank19]:[E218 15:08:23.819662624 ProcessGroupNCCL.cpp:636] [Rank 19] To avoid data inconsistency, we are taking the entire process down. [rank18]:[E218 15:08:23.825567568 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 18] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank18]:[E218 15:08:23.825589359 ProcessGroupNCCL.cpp:630] [Rank 18] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank18]:[E218 15:08:23.825594779 ProcessGroupNCCL.cpp:636] [Rank 18] To avoid data inconsistency, we are taking the entire process down. [rank48]:[E218 15:08:23.698240875 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 48] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank48]:[E218 15:08:23.698256805 ProcessGroupNCCL.cpp:630] [Rank 48] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank48]:[E218 15:08:23.698261115 ProcessGroupNCCL.cpp:636] [Rank 48] To avoid data inconsistency, we are taking the entire process down. [rank48]:[E218 15:08:23.699923730 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 48] Process group watchdog thread terminated with exception: [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7dbf3b72a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7dbef0a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7dbef0a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7dbef0a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7dbf3be735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7dbf40094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7dbf40126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank118]:[E218 15:08:23.008068411 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 118] Process group watchdog thread terminated with exception: [Rank 118] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb12636a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fb0db62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb0db631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb0db63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fb126a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fb12ac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fb12ad26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 48] Process group watchdog thread terminated with exception: [Rank 48] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7dbf3b72a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7dbef0a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7dbef0a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7dbef0a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7dbf3be735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7dbf40094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7dbf40126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7dbf3b72a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7dbef06a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7dbf3be735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7dbf40094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7dbf40126850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 118] Process group watchdog thread terminated with exception: [Rank 118] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600025 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb12636a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fb0db62a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb0db631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb0db63361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fb126a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fb12ac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fb12ad26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb12636a446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7fb0db2a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7fb126a585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7fb12ac94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7fb12ad26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank18]:[E218 15:08:23.838939932 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 18] Process group watchdog thread terminated with exception: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76cbf26a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76cba7a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76cba7a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76cba7a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x76cbf305c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x76cbf7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x76cbf7126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 18] Process group watchdog thread terminated with exception: [Rank 18] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600063 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76cbf26a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x76cba7a2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76cba7a31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76cba7a3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x76cbf305c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x76cbf7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x76cbf7126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76cbf26a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x76cba76a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x76cbf305c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x76cbf7094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x76cbf7126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank53]:[E218 15:08:23.710581031 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 53] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank53]:[E218 15:08:23.710604381 ProcessGroupNCCL.cpp:630] [Rank 53] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank53]:[E218 15:08:23.710612321 ProcessGroupNCCL.cpp:636] [Rank 53] To avoid data inconsistency, we are taking the entire process down. [rank53]:[E218 15:08:23.712531400 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 53] Process group watchdog thread terminated with exception: [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7927f98b5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7927aec2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7927aec31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7927aec3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7927f9a1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7927fe294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7927fe326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank19]:[E218 15:08:23.851159432 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 19] Process group watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b0990350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b094562a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b0945631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) what(): [PG ID 1 PG GUID 1 Rank 53] Process group watchdog thread terminated with exception: [Rank 53] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7927f98b5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7927aec2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7927aec31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b094563361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b09904ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b0994c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b0994d26850 in /lib/x86_64-linux-gnu/libc.so.6) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7927aec3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7927f9a1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7927fe294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7927fe326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7927f98b5446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7927ae8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) terminate called after throwing an instance of 'c10::DistBackendError' frame #2: + 0x145c0 (0x7927f9a1c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7927fe294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7927fe326850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 19] Process group watchdog thread terminated with exception: [Rank 19] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600052 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b0990350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7b094562a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7b0945631bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7b094563361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7b09904ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7b0994c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7b0994d26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7b0990350446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7b09452a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7b09904ab5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7b0994c94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7b0994d26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank114]:[E218 15:08:23.112495360 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 114] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank114]:[E218 15:08:23.112515360 ProcessGroupNCCL.cpp:630] [Rank 114] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank114]:[E218 15:08:23.112523440 ProcessGroupNCCL.cpp:636] [Rank 114] To avoid data inconsistency, we are taking the entire process down. [rank114]:[E218 15:08:23.114445543 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 114] Process group watchdog thread terminated with exception: [Rank 114] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7199111b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7198c642a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7198c6431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7198c643361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x719911c6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x719915a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x719915b26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 114] Process group watchdog thread terminated with exception: [Rank 114] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7199111b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7198c642a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7198c6431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7198c643361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x719911c6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x719915a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x719915b26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7199111b9446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7198c60a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x719911c6c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x719915a94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x719915b26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank115]:[E218 15:08:23.143300665 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 115] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank115]:[E218 15:08:23.143320635 ProcessGroupNCCL.cpp:630] [Rank 115] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank115]:[E218 15:08:23.143325255 ProcessGroupNCCL.cpp:636] [Rank 115] To avoid data inconsistency, we are taking the entire process down. [rank115]:[E218 15:08:23.145130236 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 115] Process group watchdog thread terminated with exception: [Rank 115] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7de89296c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7de84802a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7de848031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7de84803361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7de893a5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7de897694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7de897726850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 115] Process group watchdog thread terminated with exception: [Rank 115] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600051 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7de89296c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7de84802a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7de848031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7de84803361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7de893a5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7de897694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7de897726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7de89296c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7de847ca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7de893a5d5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7de897694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7de897726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank81]:[E218 15:08:23.959572673 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 81] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank81]:[E218 15:08:23.959600364 ProcessGroupNCCL.cpp:630] [Rank 81] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank81]:[E218 15:08:23.959607884 ProcessGroupNCCL.cpp:636] [Rank 81] To avoid data inconsistency, we are taking the entire process down. [rank81]:[E218 15:08:23.961566006 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 81] Process group watchdog thread terminated with exception: [Rank 81] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70118c56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x701141c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x701141c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x701141c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70118c9635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x701191094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x701191126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 81] Process group watchdog thread terminated with exception: [Rank 81] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600066 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70118c56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x701141c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x701141c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x701141c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x70118c9635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x701191094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x701191126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70118c56c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7011418a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x70118c9635c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x701191094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x701191126850 in /lib/x86_64-linux-gnu/libc.so.6) [rank119]:[E218 15:08:23.157516647 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 119] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank119]:[E218 15:08:23.157539777 ProcessGroupNCCL.cpp:630] [Rank 119] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank119]:[E218 15:08:23.157548547 ProcessGroupNCCL.cpp:636] [Rank 119] To avoid data inconsistency, we are taking the entire process down. [rank119]:[E218 15:08:23.159437380 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 119] Process group watchdog thread terminated with exception: [Rank 119] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e829ee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72e7df22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72e7df231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72e7df23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72e82a6735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72e82e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72e82e926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 119] Process group watchdog thread terminated with exception: [Rank 119] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600022 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e829ee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x72e7df22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x72e7df231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x72e7df23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x72e82a6735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x72e82e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x72e82e926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x72e829ee7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x72e7deea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x72e82a6735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x72e82e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x72e82e926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank84]:[E218 15:08:23.013567907 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 84] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank84]:[E218 15:08:23.013597358 ProcessGroupNCCL.cpp:630] [Rank 84] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank84]:[E218 15:08:23.013604888 ProcessGroupNCCL.cpp:636] [Rank 84] To avoid data inconsistency, we are taking the entire process down. [rank84]:[E218 15:08:23.057182461 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 84] Process group watchdog thread terminated with exception: [Rank 84] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7847ed76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7847a2e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7847a2e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7847a2e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7847ee85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7847f2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7847f2526850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 84] Process group watchdog thread terminated with exception: [Rank 84] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600039 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7847ed76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7847a2e2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7847a2e31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7847a2e3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7847ee85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7847f2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7847f2526850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7847ed76c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7847a2aa071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7847ee85c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7847f2494ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7847f2526850 in /lib/x86_64-linux-gnu/libc.so.6) [rank74]:[E218 15:08:23.709181066 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 74] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank72]:[E218 15:08:23.709183186 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 72] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank74]:[E218 15:08:23.709209298 ProcessGroupNCCL.cpp:630] [Rank 74] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank72]:[E218 15:08:23.709209838 ProcessGroupNCCL.cpp:630] [Rank 72] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank74]:[E218 15:08:23.709216678 ProcessGroupNCCL.cpp:636] [Rank 74] To avoid data inconsistency, we are taking the entire process down. [rank72]:[E218 15:08:23.709217368 ProcessGroupNCCL.cpp:636] [Rank 72] To avoid data inconsistency, we are taking the entire process down. [rank78]:[E218 15:08:24.750893514 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 78] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank78]:[E218 15:08:24.750915035 ProcessGroupNCCL.cpp:630] [Rank 78] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank78]:[E218 15:08:24.750919746 ProcessGroupNCCL.cpp:636] [Rank 78] To avoid data inconsistency, we are taking the entire process down. [rank76]:[E218 15:08:24.751893458 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 76] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank76]:[E218 15:08:24.751924489 ProcessGroupNCCL.cpp:630] [Rank 76] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank76]:[E218 15:08:24.751931870 ProcessGroupNCCL.cpp:636] [Rank 76] To avoid data inconsistency, we are taking the entire process down. [rank72]:[E218 15:08:24.769119741 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 72] Process group watchdog thread terminated with exception: [Rank 72] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb34ace7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fb30002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb300031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb30003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fb34b4735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fb34f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fb34f726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank78]:[E218 15:08:24.769140653 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 78] Process group watchdog thread terminated with exception: [Rank 78] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7913ea0a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79139f42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79139f431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79139f43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7913eaa5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7913eea94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7913eeb26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 78] Process group watchdog thread terminated with exception: [Rank 78] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600093 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7913ea0a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x79139f42a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x79139f431bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x79139f43361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7913eaa5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7913eea94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7913eeb26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7913ea0a3446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x79139f0a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7913eaa5c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7913eea94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7913eeb26850 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 1 PG GUID 1 Rank 72] Process group watchdog thread terminated with exception: [Rank 72] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600056 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb34ace7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7fb30002a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb300031bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fb30003361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7fb34b4735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7fb34f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7fb34f726850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb34ace7446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7fb2ffca071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7fb34b4735c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7fb34f694ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7fb34f726850 in /lib/x86_64-linux-gnu/libc.so.6) [rank74]:[E218 15:08:24.780405508 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 74] Process group watchdog thread terminated with exception: [Rank 74] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c2bc5576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c2b7a82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c2b7a831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c2b7a83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c2bc60555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c2bc9e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c2bc9f26850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 74] Process group watchdog thread terminated with exception: [Rank 74] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c2bc5576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c2b7a82a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c2b7a831bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c2b7a83361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c2bc60555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c2bc9e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c2bc9f26850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c2bc5576446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c2b7a4a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c2bc60555c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c2bc9e94ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c2bc9f26850 in /lib/x86_64-linux-gnu/libc.so.6) [rank76]:[E218 15:08:24.788114057 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 76] Process group watchdog thread terminated with exception: [Rank 76] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d9117f93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d90cd22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d90cd231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d90cd23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d91180ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d911c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d911c926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 76] Process group watchdog thread terminated with exception: [Rank 76] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600030 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d9117f93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7d90cd22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7d90cd231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7d90cd23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7d91180ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7d911c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7d911c926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7d9117f93446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7d90ccea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7d91180ee5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7d911c894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7d911c926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank75]:[E218 15:08:24.927908221 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 75] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank75]:[E218 15:08:24.927939223 ProcessGroupNCCL.cpp:630] [Rank 75] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank75]:[E218 15:08:24.927945374 ProcessGroupNCCL.cpp:636] [Rank 75] To avoid data inconsistency, we are taking the entire process down. [rank77]:[E218 15:08:24.928987680 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 77] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank77]:[E218 15:08:24.929028513 ProcessGroupNCCL.cpp:630] [Rank 77] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank77]:[E218 15:08:24.929035303 ProcessGroupNCCL.cpp:636] [Rank 77] To avoid data inconsistency, we are taking the entire process down. [rank75]:[E218 15:08:24.929829283 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 75] Process group watchdog thread terminated with exception: [Rank 75] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78c73e976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78c6f3c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78c6f3c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78c6f3c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78c73f0585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78c743294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78c743326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 75] Process group watchdog thread terminated with exception: [Rank 75] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600082 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78c73e976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x78c6f3c2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x78c6f3c31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x78c6f3c3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x78c73f0585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x78c743294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x78c743326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x78c73e976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x78c6f38a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x78c73f0585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x78c743294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x78c743326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank79]:[E218 15:08:24.930628164 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 79] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank79]:[E218 15:08:24.930646305 ProcessGroupNCCL.cpp:630] [Rank 79] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank79]:[E218 15:08:24.930651086 ProcessGroupNCCL.cpp:636] [Rank 79] To avoid data inconsistency, we are taking the entire process down. [rank79]:[E218 15:08:24.932566107 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 79] Process group watchdog thread terminated with exception: [Rank 79] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x700f66976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x700f1bc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x700f1bc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x700f1bc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x700f670585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x700f6b294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x700f6b326850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 79] Process group watchdog thread terminated with exception: [Rank 79] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600031 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x700f66976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x700f1bc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x700f1bc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x700f1bc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x700f670585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x700f6b294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x700f6b326850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x700f66976446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x700f1b8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x700f670585c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x700f6b294ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x700f6b326850 in /lib/x86_64-linux-gnu/libc.so.6) [rank73]:[E218 15:08:24.933367438 ProcessGroupNCCL.cpp:1834] [PG ID 1 PG GUID 1 Rank 73] Timeout at NCCL work: 159205, last enqueued NCCL work: 159208, last completed NCCL work: 159204. [rank73]:[E218 15:08:24.933388660 ProcessGroupNCCL.cpp:630] [Rank 73] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank73]:[E218 15:08:24.933395700 ProcessGroupNCCL.cpp:636] [Rank 73] To avoid data inconsistency, we are taking the entire process down. [rank73]:[E218 15:08:24.935228246 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 73] Process group watchdog thread terminated with exception: [Rank 73] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a8459ec1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a840f22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a840f231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a840f23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a845a01c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a845e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a845e926850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 73] Process group watchdog thread terminated with exception: [Rank 73] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600033 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a8459ec1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7a840f22a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7a840f231bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7a840f23361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7a845a01c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7a845e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7a845e926850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7a8459ec1446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7a840eea071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7a845a01c5c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7a845e894ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7a845e926850 in /lib/x86_64-linux-gnu/libc.so.6) [rank77]:[E218 15:08:24.974433785 ProcessGroupNCCL.cpp:1595] [PG ID 1 PG GUID 1 Rank 77] Process group watchdog thread terminated with exception: [Rank 77] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c478656c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c473bc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c473bc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c473bc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c47876535c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c478b094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c478b126850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 1 PG GUID 1 Rank 77] Process group watchdog thread terminated with exception: [Rank 77] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=159205, OpType=ALLREDUCE, NumelIn=495229180, NumelOut=495229180, Timeout(ms)=600000) ran for 600053 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c478656c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional > >) + 0x282 (0x7c473bc2a772 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c473bc31bb3 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c473bc3361d in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x145c0 (0x7c47876535c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: + 0x94ac3 (0x7c478b094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: + 0x126850 (0x7c478b126850 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c478656c446 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe4271b (0x7c473b8a071b in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x145c0 (0x7c47876535c0 in /home/zhaojiang/.local/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #3: + 0x94ac3 (0x7c478b094ac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #4: + 0x126850 (0x7c478b126850 in /lib/x86_64-linux-gnu/libc.so.6) W0218 15:09:01.352000 48757 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 48885 closing signal SIGTERM W0218 15:09:01.355000 69159 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 69720 closing signal SIGTERM W0218 15:09:01.357000 91237 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 92014 closing signal SIGTERM W0218 15:09:01.359000 1618782 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1619512 closing signal SIGTERM W0218 15:09:01.351000 3100165 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3100296 closing signal SIGTERM W0218 15:09:01.366000 91872 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 93963 closing signal SIGTERM W0218 15:09:01.372000 26625 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26763 closing signal SIGTERM W0218 15:09:01.383000 26604 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26743 closing signal SIGTERM W0218 15:09:01.384000 3089289 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3090023 closing signal SIGTERM W0218 15:09:01.387000 91580 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 91763 closing signal SIGTERM W0218 15:09:01.405000 88854 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 89628 closing signal SIGTERM W0218 15:09:01.407000 3156229 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3156385 closing signal SIGTERM W0218 15:09:01.418000 88279 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 89045 closing signal SIGTERM W0218 15:09:01.431000 88915 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 89680 closing signal SIGTERM W0218 15:09:01.691000 26057 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26858 closing signal SIGTERM W0218 15:09:31.362000 69159 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 69720 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.362000 48757 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 48885 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.362000 91237 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 92014 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.364000 1618782 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 1619512 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.364000 3100165 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 3100296 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.370000 91872 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 93963 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.376000 26625 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 26763 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.388000 26604 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 26743 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.389000 3089289 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 3090023 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.392000 91580 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 91763 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.409000 88854 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 89628 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.411000 3156229 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 3156385 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.423000 88279 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 89045 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.435000 88915 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 89680 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL W0218 15:09:31.699000 26057 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:916] Unable to shutdown process 26858 via Signals.SIGTERM, forcefully exiting via Signals.SIGKILL E0218 15:09:56.958000 88915 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 89679) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== llava/train/train_mem.py FAILED ------------------------------------------------------ Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 122 (local_rank: 2) exitcode : -6 (pid: 89681) error_file: traceback : Signal 6 (SIGABRT) received by PID 89681 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 123 (local_rank: 3) exitcode : -6 (pid: 89682) error_file: traceback : Signal 6 (SIGABRT) received by PID 89682 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 124 (local_rank: 4) exitcode : -6 (pid: 89683) error_file: traceback : Signal 6 (SIGABRT) received by PID 89683 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 125 (local_rank: 5) exitcode : -6 (pid: 89684) error_file: traceback : Signal 6 (SIGABRT) received by PID 89684 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 126 (local_rank: 6) exitcode : -6 (pid: 89685) error_file: traceback : Signal 6 (SIGABRT) received by PID 89685 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 127 (local_rank: 7) exitcode : -6 (pid: 89686) error_file: traceback : Signal 6 (SIGABRT) received by PID 89686 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-401.ar-ai-use2.hpcaas rank : 120 (local_rank: 0) exitcode : -6 (pid: 89679) error_file: traceback : Signal 6 (SIGABRT) received by PID 89679 ====================================================== E0218 15:09:57.750000 48757 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 48884) of binary: /usr/bin/python3.10 E0218 15:09:57.764000 3100165 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 3100293) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== llava/train/train_mem.py FAILED ------------------------------------------------------ Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 58 (local_rank: 2) exitcode : -6 (pid: 48886) error_file: traceback : Signal 6 (SIGABRT) received by PID 48886 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 59 (local_rank: 3) exitcode : -6 (pid: 48887) error_file: traceback : Signal 6 (SIGABRT) received by PID 48887 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 60 (local_rank: 4) exitcode : -6 (pid: 48888) error_file: traceback : Signal 6 (SIGABRT) received by PID 48888 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 61 (local_rank: 5) exitcode : -6 (pid: 48889) error_file: traceback : Signal 6 (SIGABRT) received by PID 48889 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 62 (local_rank: 6) exitcode : -6 (pid: 48890) error_file: traceback : Signal 6 (SIGABRT) received by PID 48890 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 63 (local_rank: 7) exitcode : -6 (pid: 48891) error_file: traceback : Signal 6 (SIGABRT) received by PID 48891 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-250.ar-ai-use2.hpcaas rank : 56 (local_rank: 0) exitcode : -6 (pid: 48884) error_file: traceback : Signal 6 (SIGABRT) received by PID 48884 ====================================================== Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================== llava/train/train_mem.py FAILED -------------------------------------------------------- Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 25 (local_rank: 1) exitcode : -6 (pid: 3100294) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100294 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 26 (local_rank: 2) exitcode : -6 (pid: 3100295) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100295 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 28 (local_rank: 4) exitcode : -6 (pid: 3100297) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100297 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 29 (local_rank: 5) exitcode : -6 (pid: 3100298) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100298 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 30 (local_rank: 6) exitcode : -6 (pid: 3100299) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100299 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 31 (local_rank: 7) exitcode : -6 (pid: 3100300) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100300 -------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-81.ar-ai-use2.hpcaas rank : 24 (local_rank: 0) exitcode : -6 (pid: 3100293) error_file: traceback : Signal 6 (SIGABRT) received by PID 3100293 ======================================================== E0218 15:09:57.816000 88279 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 89039) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== llava/train/train_mem.py FAILED ------------------------------------------------------ Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 97 (local_rank: 1) exitcode : -6 (pid: 89040) error_file: traceback : Signal 6 (SIGABRT) received by PID 89040 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 98 (local_rank: 2) exitcode : -6 (pid: 89041) error_file: traceback : Signal 6 (SIGABRT) received by PID 89041 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 99 (local_rank: 3) exitcode : -6 (pid: 89042) error_file: traceback : Signal 6 (SIGABRT) received by PID 89042 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 100 (local_rank: 4) exitcode : -6 (pid: 89043) error_file: traceback : Signal 6 (SIGABRT) received by PID 89043 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 101 (local_rank: 5) exitcode : -6 (pid: 89044) error_file: traceback : Signal 6 (SIGABRT) received by PID 89044 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 103 (local_rank: 7) exitcode : -6 (pid: 89046) error_file: traceback : Signal 6 (SIGABRT) received by PID 89046 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-389.ar-ai-use2.hpcaas rank : 96 (local_rank: 0) exitcode : -6 (pid: 89039) error_file: traceback : Signal 6 (SIGABRT) received by PID 89039 ====================================================== E0218 15:09:57.955000 91872 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 1 (pid: 93964) of binary: /usr/bin/python3.10 E0218 15:09:57.982000 69159 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 69717) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ E0218 15:09:58.008000 88854 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 89627) of binary: /usr/bin/python3.10 return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== llava/train/train_mem.py FAILED ------------------------------------------------------ Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 106 (local_rank: 2) exitcode : -6 (pid: 93965) error_file: traceback : Signal 6 (SIGABRT) received by PID 93965 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 107 (local_rank: 3) exitcode : -6 (pid: 93966) error_file: traceback : Signal 6 (SIGABRT) received by PID 93966 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 108 (local_rank: 4) exitcode : -6 (pid: 93967) error_file: traceback : Signal 6 (SIGABRT) received by PID 93967 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 109 (local_rank: 5) exitcode : -6 (pid: 93968) error_file: traceback : Signal 6 (SIGABRT) received by PID 93968 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 110 (local_rank: 6) exitcode : -6 (pid: 93969) error_file: traceback : Signal 6 (SIGABRT) received by PID 93969 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 111 (local_rank: 7) exitcode : -6 (pid: 93970) error_file: traceback : Signal 6 (SIGABRT) received by PID 93970 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-390.ar-ai-use2.hpcaas rank : 105 (local_rank: 1) exitcode : -6 (pid: 93964) error_file: traceback : Signal 6 (SIGABRT) received by PID 93964 ====================================================== Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== llava/train/train_mem.py FAILED ------------------------------------------------------ Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 49 (local_rank: 1) exitcode : -6 (pid: 69718) error_file: traceback : Signal 6 (SIGABRT) received by PID 69718 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 50 (local_rank: 2) exitcode : -6 (pid: 69719) error_file: traceback : Signal 6 (SIGABRT) received by PID 69719 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 52 (local_rank: 4) exitcode : -6 (pid: 69721) error_file: traceback : Signal 6 (SIGABRT) received by PID 69721 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 53 (local_rank: 5) exitcode : -6 (pid: 69722) error_file: traceback : Signal 6 (SIGABRT) received by PID 69722 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 54 (local_rank: 6) exitcode : -6 (pid: 69723) error_file: traceback : Signal 6 (SIGABRT) received by PID 69723 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 55 (local_rank: 7) exitcode : -6 (pid: 69724) error_file: traceback : Signal 6 (SIGABRT) received by PID 69724 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-249.ar-ai-use2.hpcaas rank : 48 (local_rank: 0) exitcode : -6 (pid: 69717) error_file: traceback : Signal 6 (SIGABRT) received by PID 69717 ====================================================== Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== llava/train/train_mem.py FAILED ------------------------------------------------------ Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 114 (local_rank: 2) exitcode : -6 (pid: 89629) error_file: traceback : Signal 6 (SIGABRT) received by PID 89629 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 115 (local_rank: 3) exitcode : -6 (pid: 89630) error_file: traceback : Signal 6 (SIGABRT) received by PID 89630 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 116 (local_rank: 4) exitcode : -6 (pid: 89631) error_file: traceback : Signal 6 (SIGABRT) received by PID 89631 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 117 (local_rank: 5) exitcode : -6 (pid: 89632) error_file: traceback : Signal 6 (SIGABRT) received by PID 89632 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 118 (local_rank: 6) exitcode : -6 (pid: 89633) error_file: traceback : Signal 6 (SIGABRT) received by PID 89633 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 119 (local_rank: 7) exitcode : -6 (pid: 89634) error_file: traceback : Signal 6 (SIGABRT) received by PID 89634 ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-400.ar-ai-use2.hpcaas rank : 112 (local_rank: 0) exitcode : -6 (pid: 89627) error_file: traceback : Signal 6 (SIGABRT) received by PID 89627 ====================================================== E0218 15:09:58.063000 1618782 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 1619510) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ======================================================== llava/train/train_mem.py FAILED -------------------------------------------------------- Failures: [1]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 9 (local_rank: 1) exitcode : -6 (pid: 1619511) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619511 [2]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 11 (local_rank: 3) exitcode : -6 (pid: 1619513) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619513 [3]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 12 (local_rank: 4) exitcode : -6 (pid: 1619514) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619514 [4]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 13 (local_rank: 5) exitcode : -6 (pid: 1619515) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619515 [5]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 14 (local_rank: 6) exitcode : -6 (pid: 1619516) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619516 [6]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 15 (local_rank: 7) exitcode : -6 (pid: 1619517) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619517 -------------------------------------------------------- Root Cause (first observed failure): [0]: time : 2025-02-18_15:09:01 host : h100-st-p548xlarge-78.ar-ai-use2.hpcaas rank : 8 (local_rank: 0) exitcode : -6 (pid: 1619510) error_file: traceback : Signal 6 (SIGABRT) received by PID 1619510 ======================================================== srun: error: h100-st-p548xlarge-401: task 15: Exited with exit code 1 srun: Terminating StepId=336337.0 slurmstepd: error: *** STEP 336337.0 ON h100-st-p548xlarge-77 CANCELLED AT 2025-02-18T15:09:58 *** W0218 15:09:58.156000 26057 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.157000 3156229 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.156000 3089289 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.157000 91580 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.157000 26604 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.157000 26625 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.160000 3089289 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3090023 closing signal SIGTERM W0218 15:09:58.160000 26057 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26858 closing signal SIGTERM W0218 15:09:58.160000 3156229 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3156385 closing signal SIGTERM W0218 15:09:58.160000 91580 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 91763 closing signal SIGTERM W0218 15:09:58.160000 26604 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26743 closing signal SIGTERM W0218 15:09:58.160000 26625 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 26763 closing signal SIGTERM W0218 15:09:58.157000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0218 15:09:58.163000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807965 closing signal SIGTERM W0218 15:09:58.164000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807966 closing signal SIGTERM W0218 15:09:58.165000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807967 closing signal SIGTERM W0218 15:09:58.167000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807968 closing signal SIGTERM W0218 15:09:58.167000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807969 closing signal SIGTERM W0218 15:09:58.168000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807970 closing signal SIGTERM W0218 15:09:58.169000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807971 closing signal SIGTERM W0218 15:09:58.170000 1807834 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1807972 closing signal SIGTERM E0218 15:09:58.207000 91237 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: -6) local_rank: 0 (pid: 92008) of binary: /usr/bin/python3.10 W0218 15:09:58.209000 91237 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers srun: error: h100-st-p548xlarge-250: task 7: Terminated srun: error: h100-st-p548xlarge-400: task 14: Terminated srun: error: h100-st-p548xlarge-81: task 3: Terminated srun: error: h100-st-p548xlarge-390: task 13: Terminated srun: error: h100-st-p548xlarge-389: task 12: Terminated srun: error: h100-st-p548xlarge-78: task 1: Terminated srun: error: h100-st-p548xlarge-249: task 6: Terminated Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in run self._record_worker_events(result) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 732, in _record_worker_events record(self._construct_event(state, EventSource.WORKER, worker, raw_error)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/events/__init__.py", line 73, in record _get_or_create_logger(destination).info(event.serialize()) File "/usr/lib/python3.10/logging/__init__.py", line 1477, in info self._log(INFO, msg, args, **kwargs) File "/usr/lib/python3.10/logging/__init__.py", line 1622, in _log record = self.makeRecord(self.name, level, fn, lno, msg, args, File "/usr/lib/python3.10/logging/__init__.py", line 1591, in makeRecord rv = _logRecordFactory(name, level, fn, lno, msg, args, exc_info, func, File "/usr/lib/python3.10/logging/__init__.py", line 318, in __init__ self.module = os.path.splitext(self.filename)[0] File "/usr/lib/python3.10/posixpath.py", line 125, in splitext return genericpath._splitext(p, sep, None, extsep) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 91237 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 91580 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 26625 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3156229 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3089289 got signal: 15 srun: error: h100-st-p548xlarge-383: task 10: Exited with exit code 1 srun: error: h100-st-p548xlarge-382: task 9: Exited with exit code 1 srun: error: h100-st-p548xlarge-122: task 4: Exited with exit code 1 srun: error: h100-st-p548xlarge-123: task 5: Exited with exit code 1 srun: error: h100-st-p548xlarge-80: task 2: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 26604 got signal: 15 srun: error: h100-st-p548xlarge-381: task 8: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 923, in _close handler.proc.wait() File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/usr/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 26057 got signal: 15 srun: error: h100-st-p548xlarge-384: task 11: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 855, in _invoke_run time.sleep(monitor_interval) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1807834 got signal: 15 srun: error: h100-st-p548xlarge-77: task 0: Exited with exit code 1 srun: Force Terminated StepId=336337.0 pretrain.sh: 82: python: not found