Multiple crashes when training model WhereIsAI/UAE-Large-V1

#26
by hugguc - opened

Hey, SeanLee97,

Thank you for making your model available!

I'm trying to reproduce your run as described https://angle.readthedocs.io/en/latest/notes/training.html, andgle-trainer [recommended] example. I've downloaded both your dataset SeanLee97/all_nli_angle_format_a and also base model google-bert/bert-base-uncased to my hard drive.

I'm using a linux x86_64 server with 8 core (16 thread) CPU, 64G RAM and two RTX 3090 GPUs; Ubuntu 24.04.

I'm having two issues with the code:

  1. Errors of the kind "Duplicate GPU detected : rank 11 and rank 1 both on CUDA device 2000". I provide log excerpt below. This problem goes away when I configure the trainer to run on a single CPU thread (nproc_per_node=1).

  2. Errors of the kind "CUDA out of memory.", log excerpt also below. Following your paper, you seem to be able to run that training code on a 3090. The problem doesn't go away when I configure the code to run on just a single GPU.

Thank you in advance for your advise!

-------- Duplicated GPU rank ----------
[rank11]: File "", line 88, in _run_code
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle_trainer.py", line 345,
in
[rank11]: main()
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle_trainer.py", line 309,
in main
[rank11]: model.fit(
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle.py", line 1591, in fit
[rank11]: trainer.train()
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/transformers/trainer.py", line 2241, in
train
[rank11]: return inner_training_loop(
[rank11]: ^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/transformers/trainer.py", line 2365, in
_inner_training_loop
[rank11]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1389,
in prepare
[rank11]: result = tuple(
[rank11]: ^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1390,
in
[rank11]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement
)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1263,
in _prepare_one
[rank11]: return self.prepare_model(obj, device_placement=device_placement)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1522,
in prepare_model
[rank11]: model = torch.nn.parallel.DistributedDataParallel(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line
825, in init
[rank11]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/torch/distributed/utils.py", line 294,
in _verify_param_shape_across_processes
[rank11]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:26
8, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank11]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank11]: Last error:
[rank11]: Duplicate GPU detected : rank 11 and rank 1 both on CUDA device 2000
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle_trainer.py", line 345, i
n
-------- Duplicated GPU rank END -----

------- Out of VRAM ----------
[rank11]: File "", line 88, in _run_code
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle_trainer.py", line 345,
in
[rank11]: main()
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle_trainer.py", line 309,
in main
[rank11]: model.fit(
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle.py", line 1591, in fit
[rank11]: trainer.train()
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/transformers/trainer.py", line 2241, in
train
[rank11]: return inner_training_loop(
[rank11]: ^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/transformers/trainer.py", line 2365, in
_inner_training_loop
[rank11]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1389,
in prepare
[rank11]: result = tuple(
[rank11]: ^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1390,
in
[rank11]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement
)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1263,
in _prepare_one
[rank11]: return self.prepare_model(obj, device_placement=device_placement)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/accelerate/accelerator.py", line 1522,
in prepare_model
[rank11]: model = torch.nn.parallel.DistributedDataParallel(
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/torch/nn/parallel/distributed.py", line
825, in init
[rank11]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank11]: File "/home/user/Development/venv/lib/python3.12/site-packages/torch/distributed/utils.py", line 294,
in _verify_param_shape_across_processes
[rank11]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank11]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank11]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:26
8, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.21.5
[rank11]: ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
[rank11]: Last error:
[rank11]: Duplicate GPU detected : rank 11 and rank 1 both on CUDA device 2000
[rank0]: Traceback (most recent call last):
[rank0]: File "", line 198, in _run_module_as_main
[rank0]: File "", line 88, in _run_code
[rank0]: File "/home/user/Development/venv/lib/python3.12/site-packages/angle_emb/angle_trainer.py", line 345, i
n
------- Out of VRAM END -----

WhereIsAI org

could u provide your training script?

Hi SeanLee97,

hugguc is rate limited and asked me to respond. Thank you for looking into this!

Training script is below

=======

#!/bin/sh                                                                                                          

TRAIN_DATA="/home/user/Development/data/all_nli_angle_format_a"
BASE_MODEL_DATA="/home/user/Development/data/bert-base-uncased"

WANDB_MODE=disabled CUDA_VISIBLE_DEVICES=0,1 NCCL_DEBUG=WARN \
                    torchrun --nproc_per_node=16 --master_port=1234 -m angle_emb.angle_trainer \
                    --train_name_or_path "${TRAIN_DATA}" \
                    --save_dir ckpts/bert-base-nli-test \
                    --model_name_or_path "${BASE_MODEL_DATA}" \
                    --pooling_strategy cls \
                    --maxlen 128 \
                    --ibn_w 1.0 \
                    --cln_w 1.0 \
                    --cosine_w 0.0 \
                    --angle_w 0.02 \
                    --angle_tau 20.0 \
                    --learning_rate 5e-5 \
                    --logging_steps 10 \
                    --save_steps 100 \
                    --warmup_steps 50 \
                    --batch_size 128 \
                    --seed 42 \
                    --gradient_accumulation_steps 16 \
                    --epochs 10 \
                    --fp16 1

WhereIsAI org

could u try to set the --nproc_per_node to the number of GPUs you use? here you should set it 2 since you use 2 gpus

Thanks again for looking in such a prompt manner!

Looks like I'm out of rate-limiting and can now reply.

I tried what you proposed and I see the same (or similar, I'm not sure) out of memory error as before (see message below).

Apparently the training process saves its state, hence it doesn't go through the "pre-cuda" phase when I rerun it. I'd be happy to rerun it from scratch, if you tell me what files I should remove from disk to have it forget that it's already half-way through.

Thanks again!

=========== Out of memory error line ===============
[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 384.00 MiB. GPU 0 has a total capacity of 23.57 GiB of which 69.19 MiB is free. Including non-PyTorch memory, this process has 23.42 GiB memory in use. Of the allocated memory 18.74 GiB is allocated by PyTorch, and 4.26 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
============Out of memory error line end============

WhereIsAI org

hi @hugguc , for OOM issue, you can:

  1. check your GPU status first; ensure they are free and have enough GPU memory
  2. lower the batch size to reduce GPU memory usage.
WhereIsAI org

btw, it is recommended to use format B data SeanLee97/all_nli_angle_format_b

Yey, thanks!

Your suggestion 2 actually worked. I dropped the batch size twice and now the one GPU uses 18G and the other one 23G.

If you want me to provide more insight on the other issue, let me know. As I mentioned earlier, the issue can be circumvented by setting the nproc_per_node to 1.

The training time on this dual-3090 setup was about 7 hours. Does that sound reasonable?

Thanks again!

Sign up or log in to comment