Multi GPUs Training
First, I wanna thank you for awesome work.
I want to finetune an anime model based on your AnimagineXL 3.0
I have 2 GPUs for training, i follow your training config: batch_size = 2x48 on 2 gpus, it means each gpu will do batch_size=48
but in my exp, training with batch_size=2x48 on 2 gpus is slower than batch_size=48, accumulate = 2 on single gpu
Do you have any insight about this issue? Thank you
Can you elaborate further? I still don't get what is the issue, thanks.
I use batch size 48 and grad acc 1
Can you elaborate further? I still don't get what is the issue, thanks.
I use batch size 48 and grad acc 1
Hi, I want effective batch size equal to your stage 1, it means effective batch size = 96
So I have 2 option to achieve this:
- Use 1 GPU: batch_size = 48, accumulate = 2
- Use 2 GPU: batch_size = 48, accumulate = 1
In my exp, option 2 is slower than option 1, this is my problem
That's new. Using grad acc > 1 is supposed to be a bit slower compared to only using batch size. There are a bit of problems for multi-GPU training in SD-scripts after the distributed data parallel process bug has been fixed. You may need these args.
ddp_gradient_as_bucket_view = true
ddp_static_graph = true
ddp_timeout = 100000
That's new. Using grad acc > 1 is supposed to be a bit slower compared to only using batch size. There are a bit of problems for multi-GPU training in SD-scripts after the distributed data parallel process bug has been fixed. You may need these args.
ddp_gradient_as_bucket_view = true ddp_static_graph = true ddp_timeout = 100000
Thank you, I'm going to try it.
Do you know why my A100 80GB be OOM with batch_size=16? You even work with batch size 48.
my buckets:
number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)
bucket 0: resolution (512, 1856), count: 1
bucket 1: resolution (576, 1728), count: 1
bucket 2: resolution (576, 1792), count: 3
bucket 3: resolution (640, 1536), count: 2
bucket 4: resolution (640, 1600), count: 2
bucket 5: resolution (704, 1408), count: 11
bucket 6: resolution (704, 1472), count: 1
bucket 7: resolution (768, 1280), count: 36
bucket 8: resolution (768, 1344), count: 21
bucket 9: resolution (832, 1216), count: 252
bucket 10: resolution (896, 1152), count: 103
bucket 11: resolution (960, 1088), count: 41
bucket 12: resolution (1024, 1024), count: 39
bucket 13: resolution (1088, 960), count: 24
bucket 14: resolution (1152, 896), count: 70
bucket 15: resolution (1216, 832), count: 54
bucket 16: resolution (1280, 768), count: 11
bucket 17: resolution (1344, 768), count: 49
bucket 18: resolution (1408, 704), count: 3
bucket 19: resolution (1472, 704), count: 1
bucket 20: resolution (1536, 640), count: 2