LoRA Finetuning - Text vs Vision Effects

#54
by brecker - opened

2 questions:

  1. Are you able to finetune only the textual aspect of Phi3V without finetuning the vision component?
    -Say I want to 'retain' (I assume they will degrade with finetuning) the models vision capabilities while finetuning it for function calling etc. The multi-modality of the model is important to me
  2. Which are the best target modules to target when performing LoRA FT on Phi3V?
  1. Yes you could.
  • Actually when using LoRA the official code shows you to fine tune only the language _model part.
  1. I'm not exactly sure, so I'm just performing all of the layers except for the "lm_head".

Thank you- @2U1 Can you point me to where you refer here: 'the official code shows you to fine tune only the language _model part'

@brecker
https://github.com/microsoft/Phi-3CookBook/blob/20d56d79cfd38eb175118ecc961a9b49e2341de2/code/04.Finetuning/vision_finetuning/finetune_hf_trainer_hateful_memes.py#L374-L384

here's the link for you.
However the img_projection layer is in the vision_model part so it freeze also.

If you need to controll all three part to freeze/unfreeze , you can use my code.
https://github.com/2U1/Phi3-Vision-ft

Sign up or log in to comment