OSError: google/flan-t5-xl does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
#19
by
anirudhmittal
- opened
I am getting this error while loading the model. It seems like my current version of transformers is not compatible to handle sharded models. Is there a work around other than upgrading transformers?
I am having a similar problem.
Here is the log:
Loading checkpoint shards: 0%| | 0/2 [00:05<?, ?it/s]
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 460, in load_state_dict
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 460, in load_state_dict
return torch.load(checkpoint_file, map_location="cpu")
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return torch.load(checkpoint_file, map_location="cpu") return torch.load(checkpoint_file, map_location="cpu")
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return torch.load(checkpoint_file, map_location="cpu") File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return torch.load(checkpoint_file, map_location="cpu")
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
return torch.load(checkpoint_file, map_location="cpu")
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/__init__.py", line 36, in _inner_patch
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
raise ex
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/__init__.py", line 34, in _inner_patch
result = unpickler.load()
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
result = unpickler.load()
result = unpickler.load()result = unpickler.load() File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
result = unpickler.load()
ret = patched_fn(original_fn, *args, **kwargs) File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/pytorch_bind.py", line 279, in _load
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
model = original_fn(f, *args, **kwargs)typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 809, in load
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
OSError: [Errno 14] Bad address File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/clearml/binding/frameworks/__init__.py", line 30, in _inner_patch
: [Errno 14] Bad address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
if f.read(7) == "version":
File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
if f.read(7) == "version":
if f.read(7) == "version":
File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
if f.read(7) == "version":
File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
if f.read(7) == "version":
File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
return original_fn(*args, **kwargs)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1172, in _load
(result, consumed) = self._buffer_decode(data, self.errors, final) (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError (result, consumed) = self._buffer_decode(data, self.errors, final)
(result, consumed) = self._buffer_decode(data, self.errors, final): UnicodeDecodeError
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
UnicodeDecodeErrorUnicodeDecodeError
:
During handling of the above exception, another exception occurred:
: Traceback (most recent call last):
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
:
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
Traceback (most recent call last):
During handling of the above exception, another exception occurred:
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
result = unpickler.load()
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1142, in persistent_load
main()
main()
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
main()
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
main()
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
typed_storage = load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))main()
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache') File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/serialization.py", line 1112, in load_tensor
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
storage = zip_file.get_storage_from_record(name, numel, torch.UntypedStorage)._typed_storage()._untyped_storage
OSError: [Errno 14] Bad address
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
) = cls._load_pretrained_model(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 464, in load_state_dict
) = cls._load_pretrained_model(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
) = cls._load_pretrained_model() = cls._load_pretrained_model(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
if f.read(7) == "version":
File "/sw/installed/Python/3.10.4-GCCcore-11.3.0/lib/python3.10/codecs.py", line 322, in decode
) = cls._load_pretrained_model(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte
During handling of the above exception, another exception occurred:
state_dict = load_state_dict(shard_file)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 391, in <module>
state_dict = load_state_dict(shard_file)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
state_dict = load_state_dict(shard_file)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
raise OSError(state_dict = load_state_dict(shard_file)
OSError File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
main()
raise OSError(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/diplomarbeit/DSI-QG/run_arqm_clearml.py", line 329, in main
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
state_dict = load_state_dict(shard_file)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2903, in from_pretrained
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
) = cls._load_pretrained_model(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3246, in _load_pretrained_model
state_dict = load_state_dict(shard_file)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/transformers/modeling_utils.py", line 476, in load_state_dict
raise OSError(
OSError: Unable to load weights from pytorch checkpoint file for 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin' at 'cache/models--google--flan-t5-xl/snapshots/8772db7a7a11f7b08e6be7d7088f7a7fd4813bc5/pytorch_model-00001-of-00002.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12446) of binary: /beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/bin/python
Traceback (most recent call last):
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/beegfs/.global0/ws/s7949670-diplomarbeit/kernels/diplomarbeit-py310-no-tensorflow/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I am using python 3.10.4, tokenizers==0.13.3
, torch==2.0.1+cu118
, and transformers==4.31.0
Apparently, this was caused because I did not have enough RAM memory.
Your error @guicalabria indeed seems linked to a lack of RAM.
@anirudhmittal this can also be due to a connection error. Could you try again and share the stack trace please?