Training version 896

#9
by lcolonn - opened

Has anyone managed to train this model yet? Due to the large image token sequence length, it requires using a sharding strategy. I setup all my code with Lightning rather than HF and accelerate (which was probably a mistake) and am still unable to get it to train - I am running into errors with FSDP. Wondering if anyone managed to train the model.

If this is of any help to others, using Lightning and FSDP with 8 A100-80GB I am now able to train it with a batch size of 2 in fp32. I'll try to release my fine-tuning code soon. They key here is the activation checkpointing as it's the activations taking up the vast majority of VRAM due to the large sequence length.

@lcolonn did you ever release that fine-tuning code?

Hi @kwin-sustainment , I'm quite busy finishing a project at the moment and was planning a release in around two weeks time. However, if it's urgent for you, I can send over the main elements

@lcolonn hey no worries! I actually got it working after I delved into your hint about that checkpointing & VRAM. Thank you for the advice :)

Edit: I think you need a MINIMUM of a 24GB card to train this model for people looking at this thread in the future. It would be a batch size of one and take forever, but it'd work...

Sign up or log in to comment