HuggingFaceM4/idefics2-8b-base · Some issues regarding training

Hello, I conduct post-pre-training based on the based model (idefics2-8b-base), but I have some questions.

When I used Llava1.0 image-text pair (595k pertaining data) and idefics2-8b-base for pre-training, I found that the initial loss was very high, approximately around 6 - 7. This seems abnormal? May I ask what the loss is approximately like when you complete the pertaining for 8b based model?
I found that the ignore_index seems to be different in different versions of transformers. For example, when transformers=4.40, you set image_token_id(32001) as ignore_index, while in the version 4.42 you use -100. I use the following code to set my label:

 labels[labels == self.processor.tokenizer.pad_token_id] = -100   
 labels[labels == image_token_id] =  -100

When I set the ignore label, for fake_token_around_image (32000), should it be ignored or its loss be calculated. I mean, do I need the following code:

labels[labels == fake_image_token_id] = -100