/var/spool/slurmd/job335261/slurm_script: line 12: activate: No such file or directory W0216 02:17:25.573000 3738705 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.573000 3738705 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 3738705 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.573000 3738705 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 1089965 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.573000 1089965 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 1089965 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.573000 1089965 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 1941883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.573000 1941883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 1941883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.573000 1941883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 3085469 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.573000 3085469 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.573000 3085469 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.573000 3085469 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.616000 1214129 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.616000 1214129 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.616000 1214129 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.616000 1214129 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.635000 1221719 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.635000 1221719 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.635000 1221719 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.635000 1221719 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.648000 3013148 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.648000 3013148 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.648000 3013148 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.648000 3013148 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.728000 1943572 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.728000 1943572 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.728000 1943572 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.728000 1943572 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.791000 3493174 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.791000 3493174 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.791000 3493174 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.791000 3493174 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.794000 3031269 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.794000 3031269 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.794000 3031269 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.794000 3031269 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.798000 3084128 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.798000 3084128 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.798000 3084128 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.798000 3084128 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.798000 2407212 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.798000 2407212 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.798000 2407212 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.798000 2407212 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.803000 3531051 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.803000 3531051 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.803000 3531051 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.803000 3531051 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.878000 3521676 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:25.878000 3521676 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:25.878000 3521676 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:25.878000 3521676 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:26.268000 2944883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:26.268000 2944883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:26.268000 2944883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:26.268000 2944883 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:27.305000 3345256 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] W0216 02:17:27.305000 3345256 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** W0216 02:17:27.305000 3345256 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0216 02:17:27.305000 3345256 .local/lib/python3.10/site-packages/torch/distributed/run.py:793] ***************************************** PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices [rank16]: Traceback (most recent call last): [rank16]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank16]: train(attn_implementation="flash_attention_2") [rank16]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank16]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank16]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank16]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank16]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank96]: Traceback (most recent call last): [rank96]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank96]: train(attn_implementation="flash_attention_2") [rank96]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank96]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank96]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank96]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank96]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank80]: Traceback (most recent call last): [rank80]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank80]: train(attn_implementation="flash_attention_2") [rank80]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank80]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank80]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank80]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank80]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank104]: Traceback (most recent call last): [rank104]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank104]: train(attn_implementation="flash_attention_2") [rank104]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank104]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank104]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank104]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank104]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank56]: Traceback (most recent call last): [rank56]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank56]: train(attn_implementation="flash_attention_2") [rank56]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank56]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank56]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank56]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank56]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank101]: Traceback (most recent call last): [rank101]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank101]: train(attn_implementation="flash_attention_2") [rank101]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank101]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank101]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank101]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank101]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank120]: Traceback (most recent call last): [rank120]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank120]: train(attn_implementation="flash_attention_2") [rank120]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank120]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank120]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank120]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank120]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank21]: Traceback (most recent call last): [rank21]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank21]: train(attn_implementation="flash_attention_2") [rank21]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank21]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank21]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank21]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank21]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank19]: Traceback (most recent call last): [rank19]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank19]: train(attn_implementation="flash_attention_2") [rank19]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank19]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank19]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank19]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank19]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank22]: Traceback (most recent call last): [rank22]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank22]: train(attn_implementation="flash_attention_2") [rank22]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank22]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank22]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank22]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank22]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank18]: Traceback (most recent call last): [rank18]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank18]: train(attn_implementation="flash_attention_2") [rank18]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank18]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank18]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank18]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank18]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank98]: Traceback (most recent call last): [rank98]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank98]: train(attn_implementation="flash_attention_2") [rank98]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank98]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank98]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank98]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank98]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank99]: Traceback (most recent call last): [rank99]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank99]: train(attn_implementation="flash_attention_2") [rank99]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank99]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank99]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank99]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank99]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank17]: Traceback (most recent call last): [rank17]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank17]: train(attn_implementation="flash_attention_2") [rank17]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank17]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank17]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank17]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank17]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank103]: Traceback (most recent call last): [rank103]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank103]: train(attn_implementation="flash_attention_2") [rank103]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank103]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank103]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank103]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank103]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank20]: Traceback (most recent call last): [rank20]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank20]: train(attn_implementation="flash_attention_2") [rank20]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank20]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank20]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank20]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank20]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank23]: Traceback (most recent call last): [rank23]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank23]: train(attn_implementation="flash_attention_2") [rank23]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank23]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank23]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank23]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank23]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank105]: Traceback (most recent call last): [rank105]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank105]: train(attn_implementation="flash_attention_2") [rank105]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank105]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank105]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank105]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank105]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank108]: Traceback (most recent call last): [rank108]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank108]: train(attn_implementation="flash_attention_2") [rank108]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank108]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank108]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank108]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank108]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank100]: Traceback (most recent call last): [rank100]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank100]: train(attn_implementation="flash_attention_2") [rank100]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank100]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank100]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank100]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank100]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank87]: Traceback (most recent call last): [rank87]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank87]: train(attn_implementation="flash_attention_2") [rank87]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank87]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank87]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank87]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank87]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank82]: Traceback (most recent call last): [rank82]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank82]: train(attn_implementation="flash_attention_2") [rank82]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank82]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank82]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank82]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank82]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank97]: Traceback (most recent call last): [rank97]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank97]: train(attn_implementation="flash_attention_2") [rank97]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank97]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank97]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank97]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank97]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank84]: Traceback (most recent call last): [rank84]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank84]: train(attn_implementation="flash_attention_2") [rank84]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank84]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank84]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank84]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank84]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank102]: Traceback (most recent call last): [rank102]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank102]: train(attn_implementation="flash_attention_2") [rank102]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank102]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank102]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank102]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank102]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank111]: Traceback (most recent call last): [rank111]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank111]: train(attn_implementation="flash_attention_2") [rank111]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank111]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank111]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank111]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank111]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank85]: Traceback (most recent call last): [rank85]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank85]: train(attn_implementation="flash_attention_2") [rank85]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank85]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank85]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank85]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank85]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank86]: Traceback (most recent call last): [rank86]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank86]: train(attn_implementation="flash_attention_2") [rank86]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank86]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank86]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank86]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank86]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank110]: Traceback (most recent call last): [rank110]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank110]: train(attn_implementation="flash_attention_2") [rank110]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank110]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank110]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank110]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank110]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank109]: Traceback (most recent call last): [rank109]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank109]: train(attn_implementation="flash_attention_2") [rank109]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank109]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank109]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank109]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank109]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank107]: Traceback (most recent call last): [rank107]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank107]: train(attn_implementation="flash_attention_2") [rank107]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank107]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank107]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank107]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank107]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank106]: Traceback (most recent call last): [rank106]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank106]: train(attn_implementation="flash_attention_2") [rank106]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank106]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank106]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank106]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank106]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank81]: Traceback (most recent call last): [rank81]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank81]: train(attn_implementation="flash_attention_2") [rank81]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank81]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank81]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank81]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank81]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank83]: Traceback (most recent call last): [rank83]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank83]: train(attn_implementation="flash_attention_2") [rank83]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank83]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank83]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank83]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank83]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank59]: Traceback (most recent call last): [rank59]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank59]: train(attn_implementation="flash_attention_2") [rank59]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank59]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank59]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank59]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank59]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank61]: Traceback (most recent call last): [rank61]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank61]: train(attn_implementation="flash_attention_2") [rank61]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank61]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank61]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank61]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank61]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank72]: Traceback (most recent call last): [rank72]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank72]: train(attn_implementation="flash_attention_2") [rank72]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank72]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank72]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank72]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank72]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank48]: Traceback (most recent call last): [rank48]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank48]: train(attn_implementation="flash_attention_2") [rank48]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank48]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank48]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank48]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank48]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank63]: Traceback (most recent call last): [rank63]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank63]: train(attn_implementation="flash_attention_2") [rank63]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank63]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank63]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank63]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank63]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank57]: Traceback (most recent call last): [rank57]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank57]: train(attn_implementation="flash_attention_2") [rank57]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank57]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank57]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank57]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank57]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank58]: Traceback (most recent call last): [rank58]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank58]: train(attn_implementation="flash_attention_2") [rank58]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank58]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank58]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank58]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank58]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank123]: Traceback (most recent call last): [rank123]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank123]: train(attn_implementation="flash_attention_2") [rank123]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank123]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank123]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank123]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank123]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank124]: Traceback (most recent call last): [rank124]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank124]: train(attn_implementation="flash_attention_2") [rank124]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank124]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank124]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank124]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank124]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank122]: Traceback (most recent call last): [rank122]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank122]: train(attn_implementation="flash_attention_2") [rank122]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank122]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank122]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank122]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank122]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank24]: Traceback (most recent call last): [rank24]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank24]: train(attn_implementation="flash_attention_2") [rank24]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank24]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank24]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank24]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank24]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank8]: Traceback (most recent call last): [rank8]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank8]: train(attn_implementation="flash_attention_2") [rank8]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank8]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank8]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank8]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank8]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank125]: Traceback (most recent call last): [rank125]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank125]: train(attn_implementation="flash_attention_2") [rank125]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank125]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank125]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank125]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank125]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank127]: Traceback (most recent call last): [rank127]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank127]: train(attn_implementation="flash_attention_2") [rank127]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank127]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank127]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank127]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank127]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank121]: Traceback (most recent call last): [rank121]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank121]: train(attn_implementation="flash_attention_2") [rank121]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank121]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank121]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank121]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank121]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank126]: Traceback (most recent call last): [rank126]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank126]: train(attn_implementation="flash_attention_2") [rank126]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank126]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank126]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank126]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank126]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank62]: Traceback (most recent call last): [rank62]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank62]: train(attn_implementation="flash_attention_2") [rank62]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank62]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank62]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank62]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank62]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank60]: Traceback (most recent call last): [rank60]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank60]: train(attn_implementation="flash_attention_2") [rank60]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank60]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank60]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank60]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank60]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank16]:[W216 02:18:10.688138270 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank96]:[W216 02:18:10.373108359 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank112]: Traceback (most recent call last): [rank112]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank112]: train(attn_implementation="flash_attention_2") [rank112]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank112]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank112]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank112]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank112]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank40]: Traceback (most recent call last): [rank40]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank40]: train(attn_implementation="flash_attention_2") [rank40]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank40]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank40]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank40]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank40]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank80]:[W216 02:18:10.817341077 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank74]: Traceback (most recent call last): [rank74]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank74]: train(attn_implementation="flash_attention_2") [rank74]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank74]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank74]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank74]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank74]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank77]: Traceback (most recent call last): [rank77]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank77]: train(attn_implementation="flash_attention_2") [rank77]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank77]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank77]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank77]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank77]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank76]: Traceback (most recent call last): [rank76]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank76]: train(attn_implementation="flash_attention_2") [rank76]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank76]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank76]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank76]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank76]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank79]: Traceback (most recent call last): [rank79]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank79]: train(attn_implementation="flash_attention_2") [rank79]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank79]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank79]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank79]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank79]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank78]: Traceback (most recent call last): [rank78]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank78]: train(attn_implementation="flash_attention_2") [rank78]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank78]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank78]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank78]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank78]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank75]: Traceback (most recent call last): [rank75]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank75]: train(attn_implementation="flash_attention_2") [rank75]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank75]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank75]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank75]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank75]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank54]: Traceback (most recent call last): [rank54]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank54]: train(attn_implementation="flash_attention_2") [rank54]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank54]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank54]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank54]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank54]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank73]: Traceback (most recent call last): [rank73]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank73]: train(attn_implementation="flash_attention_2") [rank73]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank73]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank73]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank73]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank73]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank104]:[W216 02:18:10.668730882 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank56]:[W216 02:18:10.535258050 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank53]: Traceback (most recent call last): [rank53]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank53]: train(attn_implementation="flash_attention_2") [rank53]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank53]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank53]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank53]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank53]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank50]: Traceback (most recent call last): [rank50]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank50]: train(attn_implementation="flash_attention_2") [rank50]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank50]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank50]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank50]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank50]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank88]: Traceback (most recent call last): [rank88]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank88]: train(attn_implementation="flash_attention_2") [rank88]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank88]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank88]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank88]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank88]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank52]: Traceback (most recent call last): [rank52]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank52]: train(attn_implementation="flash_attention_2") [rank52]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank52]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank52]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank52]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank52]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank49]: Traceback (most recent call last): [rank49]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank49]: train(attn_implementation="flash_attention_2") [rank49]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank49]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank49]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank49]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank49]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank41]: Traceback (most recent call last): [rank41]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank41]: train(attn_implementation="flash_attention_2") [rank41]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank41]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank41]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank41]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank41]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank55]: Traceback (most recent call last): [rank55]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank55]: train(attn_implementation="flash_attention_2") [rank55]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank55]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank55]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank55]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank55]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank51]: Traceback (most recent call last): [rank51]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank51]: train(attn_implementation="flash_attention_2") [rank51]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank51]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank51]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank51]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank51]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank42]: Traceback (most recent call last): [rank42]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank42]: train(attn_implementation="flash_attention_2") [rank42]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank42]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank42]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank42]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank42]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank47]: Traceback (most recent call last): [rank47]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank47]: train(attn_implementation="flash_attention_2") [rank47]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank47]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank47]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank47]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank47]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank46]: Traceback (most recent call last): [rank46]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank46]: train(attn_implementation="flash_attention_2") [rank46]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank46]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank46]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank46]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank46]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank118]: Traceback (most recent call last): [rank118]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank118]: train(attn_implementation="flash_attention_2") [rank118]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank118]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank118]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank118]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank118]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank120]:[W216 02:18:10.607501450 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank113]: Traceback (most recent call last): [rank113]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank113]: train(attn_implementation="flash_attention_2") [rank113]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank113]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank113]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank113]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank113]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank92]: Traceback (most recent call last): [rank92]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank92]: train(attn_implementation="flash_attention_2") [rank92]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank92]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank92]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank92]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank92]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank91]: Traceback (most recent call last): [rank91]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank91]: train(attn_implementation="flash_attention_2") [rank91]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank91]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank91]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank91]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank91]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank89]: Traceback (most recent call last): [rank89]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank89]: train(attn_implementation="flash_attention_2") [rank89]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank89]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank89]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank89]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank89]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank44]: Traceback (most recent call last): [rank44]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank44]: train(attn_implementation="flash_attention_2") [rank44]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank44]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank44]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank44]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank44]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank43]: Traceback (most recent call last): [rank43]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank43]: train(attn_implementation="flash_attention_2") [rank43]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank43]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank43]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank43]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank43]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank45]: Traceback (most recent call last): [rank45]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank45]: train(attn_implementation="flash_attention_2") [rank45]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank45]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank45]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank45]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank45]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank95]: Traceback (most recent call last): [rank95]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank95]: train(attn_implementation="flash_attention_2") [rank95]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank95]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank95]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank95]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank95]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank115]: Traceback (most recent call last): [rank115]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank115]: train(attn_implementation="flash_attention_2") [rank115]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank115]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank115]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank115]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank115]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank93]: Traceback (most recent call last): [rank93]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank93]: train(attn_implementation="flash_attention_2") [rank93]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank93]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank93]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank93]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank93]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank116]: Traceback (most recent call last): [rank116]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank116]: train(attn_implementation="flash_attention_2") [rank116]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank116]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank116]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank116]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank116]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank117]: Traceback (most recent call last): [rank117]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank117]: train(attn_implementation="flash_attention_2") [rank117]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank117]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank117]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank117]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank117]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices PyTorch: setting up devices [rank90]: Traceback (most recent call last): [rank90]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank90]: train(attn_implementation="flash_attention_2") [rank90]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank90]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank90]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank90]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank90]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank94]: Traceback (most recent call last): [rank94]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank94]: train(attn_implementation="flash_attention_2") [rank94]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank94]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank94]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank94]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank94]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank114]: Traceback (most recent call last): [rank114]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank114]: train(attn_implementation="flash_attention_2") [rank114]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank114]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank114]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank114]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank114]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank119]: Traceback (most recent call last): [rank119]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank119]: train(attn_implementation="flash_attention_2") [rank119]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank119]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank119]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank119]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank119]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank32]: Traceback (most recent call last): [rank32]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank32]: train(attn_implementation="flash_attention_2") [rank32]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank32]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank32]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank32]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank32]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank13]: Traceback (most recent call last): [rank13]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank13]: train(attn_implementation="flash_attention_2") [rank13]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank13]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank13]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank13]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank13]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank14]: Traceback (most recent call last): [rank14]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank14]: train(attn_implementation="flash_attention_2") [rank14]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank14]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank14]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank14]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank14]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank64]: Traceback (most recent call last): [rank64]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank64]: train(attn_implementation="flash_attention_2") [rank64]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank64]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank64]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank64]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank64]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank10]: Traceback (most recent call last): [rank10]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank10]: train(attn_implementation="flash_attention_2") [rank10]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank10]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank10]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank10]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank10]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank38]: Traceback (most recent call last): [rank38]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank38]: train(attn_implementation="flash_attention_2") [rank38]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank38]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank38]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank38]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank38]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank36]: Traceback (most recent call last): [rank36]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank36]: train(attn_implementation="flash_attention_2") [rank36]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank36]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank36]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank36]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank36]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank9]: Traceback (most recent call last): [rank9]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank9]: train(attn_implementation="flash_attention_2") [rank9]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank9]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank9]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank9]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank9]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank72]:[W216 02:18:10.762034624 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank15]: Traceback (most recent call last): [rank15]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank15]: train(attn_implementation="flash_attention_2") [rank15]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank15]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank15]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank15]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank15]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank12]: Traceback (most recent call last): [rank12]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank12]: train(attn_implementation="flash_attention_2") [rank12]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank12]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank12]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank12]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank12]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank11]: Traceback (most recent call last): [rank11]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank11]: train(attn_implementation="flash_attention_2") [rank11]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank11]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank11]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank11]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank11]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank37]: Traceback (most recent call last): [rank37]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank37]: train(attn_implementation="flash_attention_2") [rank37]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank37]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank37]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank37]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank37]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank34]: Traceback (most recent call last): [rank34]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank34]: train(attn_implementation="flash_attention_2") [rank34]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank34]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank34]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank34]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank34]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank29]: Traceback (most recent call last): [rank29]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank29]: train(attn_implementation="flash_attention_2") [rank29]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank29]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank29]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank29]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank29]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank35]: Traceback (most recent call last): [rank35]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank35]: train(attn_implementation="flash_attention_2") [rank35]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank35]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank35]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank35]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank35]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank30]: Traceback (most recent call last): [rank30]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank30]: train(attn_implementation="flash_attention_2") [rank30]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank30]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank30]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank30]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank30]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank33]: Traceback (most recent call last): [rank33]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank33]: train(attn_implementation="flash_attention_2") [rank33]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank33]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank33]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank33]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank33]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank26]: Traceback (most recent call last): [rank26]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank26]: train(attn_implementation="flash_attention_2") [rank26]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank26]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank26]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank26]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank26]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank27]: Traceback (most recent call last): [rank27]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank27]: train(attn_implementation="flash_attention_2") [rank27]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank27]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank27]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank27]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank27]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank31]: Traceback (most recent call last): [rank31]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank31]: train(attn_implementation="flash_attention_2") [rank31]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank31]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank31]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank31]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank31]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank39]: Traceback (most recent call last): [rank39]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank39]: train(attn_implementation="flash_attention_2") [rank39]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank39]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank39]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank39]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank39]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank28]: Traceback (most recent call last): [rank28]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank28]: train(attn_implementation="flash_attention_2") [rank28]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank28]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank28]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank28]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank28]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank48]:[W216 02:18:10.989800669 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank25]: Traceback (most recent call last): [rank25]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank25]: train(attn_implementation="flash_attention_2") [rank25]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank25]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank25]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank25]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank25]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank112]:[W216 02:18:10.645294270 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank40]:[W216 02:18:10.366061795 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank66]: Traceback (most recent call last): [rank66]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank66]: train(attn_implementation="flash_attention_2") [rank66]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank66]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank66]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank66]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank66]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank69]: Traceback (most recent call last): [rank69]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank69]: train(attn_implementation="flash_attention_2") [rank69]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank69]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank69]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank69]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank69]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank70]: Traceback (most recent call last): [rank70]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank70]: train(attn_implementation="flash_attention_2") [rank70]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank70]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank70]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank70]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank70]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank24]:[W216 02:18:11.633587079 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank65]: Traceback (most recent call last): [rank65]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank65]: train(attn_implementation="flash_attention_2") [rank65]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank65]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank65]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank65]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank65]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank8]:[W216 02:18:11.929085758 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank71]: Traceback (most recent call last): [rank71]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank71]: train(attn_implementation="flash_attention_2") [rank71]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank71]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank71]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank71]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank71]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank68]: Traceback (most recent call last): [rank68]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank68]: train(attn_implementation="flash_attention_2") [rank68]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank68]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank68]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank68]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank68]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank67]: Traceback (most recent call last): [rank67]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank67]: train(attn_implementation="flash_attention_2") [rank67]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank67]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank67]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank67]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank67]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank88]:[W216 02:18:11.795214086 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank32]:[W216 02:18:11.074655039 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) [rank0]: Traceback (most recent call last): [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank0]: train(attn_implementation="flash_attention_2") [rank0]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank0]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank0]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank0]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank0]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank64]:[W216 02:18:11.347122078 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) W0216 02:18:11.930000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085601 closing signal SIGTERM W0216 02:18:11.931000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085602 closing signal SIGTERM W0216 02:18:11.932000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085603 closing signal SIGTERM W0216 02:18:11.932000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085604 closing signal SIGTERM W0216 02:18:11.932000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085605 closing signal SIGTERM W0216 02:18:11.932000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085606 closing signal SIGTERM W0216 02:18:11.932000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090134 closing signal SIGTERM W0216 02:18:11.933000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3085607 closing signal SIGTERM W0216 02:18:11.933000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090135 closing signal SIGTERM W0216 02:18:11.933000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090136 closing signal SIGTERM W0216 02:18:11.934000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090137 closing signal SIGTERM W0216 02:18:11.935000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090138 closing signal SIGTERM W0216 02:18:11.935000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090139 closing signal SIGTERM W0216 02:18:11.935000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1090140 closing signal SIGTERM W0216 02:18:12.030000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942053 closing signal SIGTERM W0216 02:18:12.030000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942054 closing signal SIGTERM W0216 02:18:12.031000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942055 closing signal SIGTERM W0216 02:18:12.031000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942056 closing signal SIGTERM W0216 02:18:12.031000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942057 closing signal SIGTERM W0216 02:18:12.031000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942058 closing signal SIGTERM W0216 02:18:12.032000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1942059 closing signal SIGTERM W0216 02:18:12.032000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221847 closing signal SIGTERM W0216 02:18:12.032000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221848 closing signal SIGTERM W0216 02:18:12.033000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221849 closing signal SIGTERM W0216 02:18:12.033000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221850 closing signal SIGTERM W0216 02:18:12.033000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221851 closing signal SIGTERM W0216 02:18:12.034000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221852 closing signal SIGTERM W0216 02:18:12.034000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1221853 closing signal SIGTERM [rank0]:[W216 02:18:12.308290844 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator()) W0216 02:18:12.040000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031397 closing signal SIGTERM W0216 02:18:12.041000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031398 closing signal SIGTERM W0216 02:18:12.042000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031399 closing signal SIGTERM W0216 02:18:12.042000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031400 closing signal SIGTERM W0216 02:18:12.043000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031401 closing signal SIGTERM W0216 02:18:12.043000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031402 closing signal SIGTERM W0216 02:18:12.044000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3031403 closing signal SIGTERM [rank3]: Traceback (most recent call last): [rank3]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank3]: train(attn_implementation="flash_attention_2") [rank3]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank3]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank3]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank3]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank3]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank6]: Traceback (most recent call last): [rank6]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank6]: train(attn_implementation="flash_attention_2") [rank6]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank6]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank6]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank6]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank6]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank2]: Traceback (most recent call last): [rank2]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank2]: train(attn_implementation="flash_attention_2") [rank2]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank2]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank2]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank2]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank2]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank7]: Traceback (most recent call last): [rank7]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank7]: train(attn_implementation="flash_attention_2") [rank7]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank7]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank7]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank7]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank7]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank1]: Traceback (most recent call last): [rank1]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank1]: train(attn_implementation="flash_attention_2") [rank1]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank1]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank1]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank1]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank1]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank5]: Traceback (most recent call last): [rank5]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank5]: train(attn_implementation="flash_attention_2") [rank5]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank5]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank5]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank5]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank5]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] [rank4]: Traceback (most recent call last): [rank4]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train_mem.py", line 4, in [rank4]: train(attn_implementation="flash_attention_2") [rank4]: File "/opt/hpcaas/.mounts/fs-036153e63d56f4dc2/home/zhaojiang/interleaved-llava/llava/train/train.py", line 1423, in train [rank4]: model_args, data_args, training_args = parser.parse_args_into_dataclasses() [rank4]: File "/home/zhaojiang/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 366, in parse_args_into_dataclasses [rank4]: raise ValueError(f"Some specified arguments are not used by the HfArgumentParser: {remaining_args}") [rank4]: ValueError: Some specified arguments are not used by the HfArgumentParser: ['--datacomp_image_folder', '/fsx_0/user/zhaojiang/models/hub/datasets--umd-vt-nyu--datacomp_1B_21M/snapshots/a199d89dd1a6f60b68a2bddc7fb289bb8600c57b'] W0216 02:18:12.533000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522446 closing signal SIGTERM W0216 02:18:12.534000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522447 closing signal SIGTERM W0216 02:18:12.534000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522448 closing signal SIGTERM W0216 02:18:12.534000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345420 closing signal SIGTERM W0216 02:18:12.534000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522449 closing signal SIGTERM W0216 02:18:12.534000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345421 closing signal SIGTERM W0216 02:18:12.535000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522450 closing signal SIGTERM W0216 02:18:12.535000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345422 closing signal SIGTERM W0216 02:18:12.535000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522451 closing signal SIGTERM W0216 02:18:12.535000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345423 closing signal SIGTERM W0216 02:18:12.534000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943695 closing signal SIGTERM W0216 02:18:12.535000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522452 closing signal SIGTERM W0216 02:18:12.535000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345424 closing signal SIGTERM W0216 02:18:12.535000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943696 closing signal SIGTERM W0216 02:18:12.535000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345425 closing signal SIGTERM W0216 02:18:12.536000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345426 closing signal SIGTERM W0216 02:18:12.536000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943697 closing signal SIGTERM W0216 02:18:12.536000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945208 closing signal SIGTERM W0216 02:18:12.536000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943698 closing signal SIGTERM W0216 02:18:12.536000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943699 closing signal SIGTERM W0216 02:18:12.536000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945209 closing signal SIGTERM W0216 02:18:12.537000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945210 closing signal SIGTERM W0216 02:18:12.537000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943700 closing signal SIGTERM W0216 02:18:12.537000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945211 closing signal SIGTERM W0216 02:18:12.537000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943701 closing signal SIGTERM W0216 02:18:12.538000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945212 closing signal SIGTERM W0216 02:18:12.538000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945213 closing signal SIGTERM W0216 02:18:12.636000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214255 closing signal SIGTERM W0216 02:18:12.637000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214256 closing signal SIGTERM W0216 02:18:12.637000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214257 closing signal SIGTERM W0216 02:18:12.637000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214258 closing signal SIGTERM W0216 02:18:12.638000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214259 closing signal SIGTERM W0216 02:18:12.638000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214260 closing signal SIGTERM W0216 02:18:12.638000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214261 closing signal SIGTERM W0216 02:18:12.736000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084300 closing signal SIGTERM W0216 02:18:12.736000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084301 closing signal SIGTERM W0216 02:18:12.737000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084302 closing signal SIGTERM W0216 02:18:12.738000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084303 closing signal SIGTERM W0216 02:18:12.738000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084304 closing signal SIGTERM W0216 02:18:12.738000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084305 closing signal SIGTERM W0216 02:18:12.738000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084306 closing signal SIGTERM W0216 02:18:12.835000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407401 closing signal SIGTERM W0216 02:18:12.836000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407402 closing signal SIGTERM W0216 02:18:12.837000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407403 closing signal SIGTERM W0216 02:18:12.837000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407404 closing signal SIGTERM W0216 02:18:12.837000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407405 closing signal SIGTERM W0216 02:18:12.838000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407406 closing signal SIGTERM W0216 02:18:12.838000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407407 closing signal SIGTERM W0216 02:18:12.937000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493300 closing signal SIGTERM W0216 02:18:12.937000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493301 closing signal SIGTERM W0216 02:18:12.938000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493302 closing signal SIGTERM W0216 02:18:12.939000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493303 closing signal SIGTERM W0216 02:18:12.939000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493304 closing signal SIGTERM W0216 02:18:12.940000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493305 closing signal SIGTERM W0216 02:18:12.940000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493306 closing signal SIGTERM W0216 02:18:13.237000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738835 closing signal SIGTERM W0216 02:18:13.238000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738836 closing signal SIGTERM W0216 02:18:13.238000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738837 closing signal SIGTERM W0216 02:18:13.238000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738838 closing signal SIGTERM W0216 02:18:13.238000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738839 closing signal SIGTERM W0216 02:18:13.239000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738840 closing signal SIGTERM E0216 02:18:13.649000 1941883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1942052) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ llava/train/train_mem.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-16_02:18:12 host : h100-st-p548xlarge-166.ar-ai-use2.hpcaas rank : 56 (local_rank: 0) exitcode : 1 (pid: 1942052) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ E0216 02:18:13.715000 1089965 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1090133) of binary: /usr/bin/python3.10 E0216 02:18:13.727000 3085469 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 3085600) of binary: /usr/bin/python3.10 W0216 02:18:13.737000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531224 closing signal SIGTERM W0216 02:18:13.738000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531225 closing signal SIGTERM W0216 02:18:13.738000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531226 closing signal SIGTERM W0216 02:18:13.738000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531227 closing signal SIGTERM W0216 02:18:13.739000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531228 closing signal SIGTERM W0216 02:18:13.739000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531229 closing signal SIGTERM W0216 02:18:13.739000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531230 closing signal SIGTERM W0216 02:18:13.739000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013278 closing signal SIGTERM W0216 02:18:13.740000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013279 closing signal SIGTERM W0216 02:18:13.740000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013280 closing signal SIGTERM W0216 02:18:13.741000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013281 closing signal SIGTERM W0216 02:18:13.741000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013282 closing signal SIGTERM W0216 02:18:13.741000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013283 closing signal SIGTERM W0216 02:18:13.741000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013284 closing signal SIGTERM Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ llava/train/train_mem.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-16_02:18:11 host : h100-st-p548xlarge-232.ar-ai-use2.hpcaas rank : 96 (local_rank: 0) exitcode : 1 (pid: 1090133) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ llava/train/train_mem.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-16_02:18:11 host : h100-st-p548xlarge-74.ar-ai-use2.hpcaas rank : 16 (local_rank: 0) exitcode : 1 (pid: 3085600) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ E0216 02:18:13.764000 1221719 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1221846) of binary: /usr/bin/python3.10 E0216 02:18:13.788000 3031269 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 3031396) of binary: /usr/bin/python3.10 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ llava/train/train_mem.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-16_02:18:12 host : h100-st-p548xlarge-233.ar-ai-use2.hpcaas rank : 104 (local_rank: 0) exitcode : 1 (pid: 1221846) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ llava/train/train_mem.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-02-16_02:18:12 host : h100-st-p548xlarge-196.ar-ai-use2.hpcaas rank : 80 (local_rank: 0) exitcode : 1 (pid: 3031396) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: h100-st-p548xlarge-166: task 7: Exited with exit code 1 srun: Terminating StepId=335261.0 slurmstepd: error: *** STEP 335261.0 ON h100-st-p548xlarge-51 CANCELLED AT 2025-02-16T02:18:13 *** W0216 02:18:13.993000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 3521676 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3522451 closing signal SIGTERM W0216 02:18:13.993000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943696 closing signal SIGTERM W0216 02:18:13.993000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.993000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.994000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013280 closing signal SIGTERM W0216 02:18:13.993000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.994000 3084128 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3084301 closing signal SIGTERM W0216 02:18:13.994000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738835 closing signal SIGTERM W0216 02:18:13.994000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945208 closing signal SIGTERM W0216 02:18:13.994000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345421 closing signal SIGTERM W0216 02:18:13.994000 2407212 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2407403 closing signal SIGTERM W0216 02:18:13.994000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.994000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531224 closing signal SIGTERM W0216 02:18:13.994000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013281 closing signal SIGTERM W0216 02:18:13.994000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943699 closing signal SIGTERM W0216 02:18:13.994000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531225 closing signal SIGTERM W0216 02:18:13.994000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214255 closing signal SIGTERM W0216 02:18:13.994000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013282 closing signal SIGTERM W0216 02:18:13.994000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py:704] Received Signals.SIGTERM death signal, shutting down workers W0216 02:18:13.994000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738839 closing signal SIGTERM W0216 02:18:13.994000 2944883 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2945213 closing signal SIGTERM W0216 02:18:13.994000 1943572 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1943701 closing signal SIGTERM W0216 02:18:13.994000 3345256 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3345426 closing signal SIGTERM W0216 02:18:13.995000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013283 closing signal SIGTERM W0216 02:18:13.995000 3493174 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3493303 closing signal SIGTERM W0216 02:18:13.995000 3738705 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3738840 closing signal SIGTERM W0216 02:18:13.995000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531228 closing signal SIGTERM W0216 02:18:13.995000 3013148 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3013284 closing signal SIGTERM W0216 02:18:13.995000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214260 closing signal SIGTERM W0216 02:18:13.995000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531229 closing signal SIGTERM W0216 02:18:13.995000 1214129 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1214261 closing signal SIGTERM W0216 02:18:13.995000 3531051 .local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 3531230 closing signal SIGTERM srun: error: h100-st-p548xlarge-74: task 2: Exited with exit code 1 srun: error: h100-st-p548xlarge-232: task 12: Exited with exit code 1 srun: error: h100-st-p548xlarge-233: task 13: Terminated srun: error: h100-st-p548xlarge-196: task 10: Terminated Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1943572 got signal: 15 time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2944883 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 1214129 got signal: 15 return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3531051 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3345256 got signal: 15 srun: error: h100-st-p548xlarge-165: task 6: Exited with exit code 1 srun: error: h100-st-p548xlarge-239: task 15: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3084128 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait srun: error: h100-st-p548xlarge-169: task 8: Exited with exit code 1 return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3493174 got signal: 15 srun: error: h100-st-p548xlarge-238: task 14: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 2407212 got signal: 15 srun: error: h100-st-p548xlarge-113: task 5: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3738705 got signal: 15 srun: error: h100-st-p548xlarge-75: task 3: Exited with exit code 1 srun: error: h100-st-p548xlarge-197: task 11: Exited with exit code 1 srun: error: h100-st-p548xlarge-52: task 1: Exited with exit code 1 srun: error: h100-st-p548xlarge-112: task 4: Exited with exit code 1 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3521676 got signal: 15 Traceback (most recent call last): File "/home/zhaojiang/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 260, in launch_agent result = agent.run() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 696, in run result = self._invoke_run(role) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 856, in _invoke_run run_result = self._monitor_workers(self._worker_group) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 137, in wrapper result = f(*args, **kwargs) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 387, in _monitor_workers result = self._pcontext.wait(0) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 531, in wait return self._poll() File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 861, in _poll self.close() # terminate all running procs File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 572, in close self._close(death_sig=death_sig, timeout=timeout) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 909, in _close handler.proc.wait(time_to_wait) File "/usr/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/usr/lib/python3.10/subprocess.py", line 1953, in _wait time.sleep(delay) File "/home/zhaojiang/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 84, in _terminate_process_handler raise SignalException(f"Process {os.getpid()} got signal: {sigval}", sigval=sigval) torch.distributed.elastic.multiprocessing.api.SignalException: Process 3013148 got signal: 15 srun: error: h100-st-p548xlarge-170: task 9: Exited with exit code 1 srun: error: h100-st-p548xlarge-51: task 0: Exited with exit code 1 srun: Force Terminated StepId=335261.0