Question about chat vector merging

by qutrino - opened 6 days ago

6 days ago

Thank you for releasing such an excellent model.

I noticed that the qwen2.5-bakeneko-32b-instruct model states that "the embedding layer was omitted when performing the subtraction and addition of parameter vectors," but this note is absent in deepseek-r1-distill-qwen2.5-bakeneko-32b.

Could you please clarify which approach was used for this model? If a different method was applied, a brief explanation of the rationale would be greatly appreciated.

Thank you for your help.

Keely0419

rinna Co., Ltd. org 4 days ago

Thx for your comment! Just as mentioned in the model card, we merged ALL layers when developing deepseek-r1-distill-qwen2.5-bakeneko-32b.
The DeepSeek-R1 series uses LlamaTokenizer, while the Qwen2.5 series uses QwenTokenizer. Our merging approach was chosen to better accommodate this tokenizer transition.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment