Multi GPUs Training

#16
by toilaluan - opened

First, I wanna thank you for awesome work.
I want to finetune an anime model based on your AnimagineXL 3.0
I have 2 GPUs for training, i follow your training config: batch_size = 2x48 on 2 gpus, it means each gpu will do batch_size=48
but in my exp, training with batch_size=2x48 on 2 gpus is slower than batch_size=48, accumulate = 2 on single gpu
Do you have any insight about this issue? Thank you

Cagliostro Research Lab org

Can you elaborate further? I still don't get what is the issue, thanks.
I use batch size 48 and grad acc 1

Can you elaborate further? I still don't get what is the issue, thanks.
I use batch size 48 and grad acc 1

Hi, I want effective batch size equal to your stage 1, it means effective batch size = 96
So I have 2 option to achieve this:

  1. Use 1 GPU: batch_size = 48, accumulate = 2
  2. Use 2 GPU: batch_size = 48, accumulate = 1
    In my exp, option 2 is slower than option 1, this is my problem
Cagliostro Research Lab org

That's new. Using grad acc > 1 is supposed to be a bit slower compared to only using batch size. There are a bit of problems for multi-GPU training in SD-scripts after the distributed data parallel process bug has been fixed. You may need these args.

ddp_gradient_as_bucket_view = true
ddp_static_graph = true
ddp_timeout = 100000

That's new. Using grad acc > 1 is supposed to be a bit slower compared to only using batch size. There are a bit of problems for multi-GPU training in SD-scripts after the distributed data parallel process bug has been fixed. You may need these args.

ddp_gradient_as_bucket_view = true
ddp_static_graph = true
ddp_timeout = 100000

Thank you, I'm going to try it.
Do you know why my A100 80GB be OOM with batch_size=16? You even work with batch size 48.
my buckets:

number of images (including repeats) / 各bucketの画像枚数(繰り返し回数を含む)                                                        
bucket 0: resolution (512, 1856), count: 1                                                                                             
bucket 1: resolution (576, 1728), count: 1                                                                                             
bucket 2: resolution (576, 1792), count: 3                                                                                             
bucket 3: resolution (640, 1536), count: 2                                                                                             
bucket 4: resolution (640, 1600), count: 2                                                                                             
bucket 5: resolution (704, 1408), count: 11                                                                                            
bucket 6: resolution (704, 1472), count: 1                                                                                             
bucket 7: resolution (768, 1280), count: 36                                                                                            
bucket 8: resolution (768, 1344), count: 21                                                                                            
bucket 9: resolution (832, 1216), count: 252                                                                                           
bucket 10: resolution (896, 1152), count: 103                                                                                          
bucket 11: resolution (960, 1088), count: 41                                                                                           
bucket 12: resolution (1024, 1024), count: 39                                                                                          
bucket 13: resolution (1088, 960), count: 24                                                                                           
bucket 14: resolution (1152, 896), count: 70                                                                                           
bucket 15: resolution (1216, 832), count: 54                                                                                           
bucket 16: resolution (1280, 768), count: 11                                                                                           
bucket 17: resolution (1344, 768), count: 49                                                                                           
bucket 18: resolution (1408, 704), count: 3                                                                                            
bucket 19: resolution (1472, 704), count: 1                                                                                            
bucket 20: resolution (1536, 640), count: 2
Linaqruf changed discussion status to closed

Sign up or log in to comment