vikhyatk/moondream2 · Finetuning for pointing, object detection task

This is a fantastic model! I really appreciate the pointing feature—it's something I’ve only seen in the Molmo model before. However, unlike Molmo, which outputs point coordinates as HTML tags in its text output, this model appears to have dedicated heads for generating points and bounding boxes, which is very impressive.

I’m particularly curious about how these new features can be fine-tuned. Do you plan to release a notebook demonstrating the process (for pointing and object detection)? A blog post explaining the model's architecture would also be incredibly helpful for understanding its unique capabilities.

Additionally, I noticed that the example code includes special methods like model.caption and model.query. Does this mean the model cannot be used like a traditional vision-language model? Is it possible to input a chat history for conversational use?

Thank you again for this amazing model!