optimum quanto

#3
by JiriCoufal - opened

Hi, very inspiring traininig you did here.

How much of VRAM usage did optimum quanto helped to reduce? I am amazed you did your initial training on 5x3090...

Did you used work from this or you starded from scratch on H100s?

Storing the transformer in int8 means 23 GB (bf16) -> 11.5 GB. Then for the 3090 run I used a batch size of 1 (-> 5 on DDP), 600m parameter LoKr which is about 1.2 GB in bf16. With the optim (adamw_bf16) there are 3x states stored, so that's 1.2 * 4 = 4.8 GB for the LoKr. That's 16.3 GB for model, LoKr, and optim, leaving about 8 GB left for grads. At batch size 1 with 1 MP images I would use about 23 GB max.

The H100 model was continued from the 3090 run by merging the 600m LoKr into the weights and then adding the 3.2 GB LoKr on top. All checkpoints are in here: https://huggingface.co/jimmycarter/flux-training

I updated the README to add in the training configs. I don't think there's ever a reason to train FLUX full rank given LoKr exists and how effective it is. @KBlueLeaf used it to train KohakuDiffusion and teach SDXL thousands of anime characters, and @bghira full-rank finetuned FLUX in FP8 using torchao which ended up damaging it, then brought it back to life somehow with a 1.2 B LoKr.

The 3.2 B config file I included here should be enough for any large scale training on FLUX and can do a batch size of 6 on H100 with 1 MP images, leaving room for larger images if you wish to train them too.

Sign up or log in to comment