got torch.distributed.elastic.multiprocessing.errors.ChildFailedError

#9
by szhao41 - opened

I tried the fine-tuning script 'sh finetune.sh', it shows Your setup doesn't support bf16/gpu. You need torch>=1.10, using Ampere GPU with cuda>=11.0, if I change the bf16 to fp16, then there is another error torch.distributed.elastic.multiprocessing.errors.ChildFailedError. is that something related, should I still use bf16? and the GPU is Type=Tesla-V100-GENERAL-32GB, so it may not support bf16

No, bf16 is not supported for Volta and older series of GPUs. You need to set the argument for bf16 to False, wherever it's present in the training arguments.

Sign up or log in to comment