Some issues regarding training
#9
by
zwq2018
- opened
Hello, I conduct post-pre-training based on the based model (idefics2-8b-base), but I have some questions.
When I used Llava1.0 image-text pair (595k pertaining data) and idefics2-8b-base for pre-training, I found that the initial loss was very high, approximately around 6 - 7. This seems abnormal? May I ask what the loss is approximately like when you complete the pertaining for 8b based model?
I found that the ignore_index seems to be different in different versions of transformers. For example, when transformers=4.40, you set image_token_id(32001) as ignore_index, while in the version 4.42 you use -100. I use the following code to set my label:
labels[labels == self.processor.tokenizer.pad_token_id] = -100
labels[labels == image_token_id] = -100
- When I set the ignore label, for fake_token_around_image (32000), should it be ignored or its loss be calculated. I mean, do I need the following code:
labels[labels == fake_image_token_id] = -100
- Loss values can depend on many factors, you should track the performance on different tasks instead to see if there is a bug?
- We ignore the loss calculation on both the pad tokens and the image tokens yes.
- We didn't mask the loss on these tokens but you can do it, yes