bob bob

bobrandomnumber

AI & ML interests

None yet

Recent Activity

reacted to merve's post with 👍 22 days ago

ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗 https://huggingface.co/collections/ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093 > The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️ > The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint) > The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬 the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️ > Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.

liked a model 10 months ago

roborovski/superprompt-v1

liked a Space 11 months ago

multimodalart/lora-ease

View all activity

Organizations

None yet

bobrandomnumber's activity

reacted to merve's post with 👍 22 days ago

Post

1801

ByteDance just dropped SA2VA: a new family of vision LMs combining Qwen2VL/InternVL and SAM2 with MIT license 💗 ByteDance/sa2va-model-zoo-677e3084d71b5f108d00e093

> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️

> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)

> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬

the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️

> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.

1 reply

liked a model 10 months ago

roborovski/superprompt-v1

Text2Text Generation • Updated Jul 3, 2024 • 39.6k • 78

liked a Space 11 months ago

Runtime error

367

🧞

LoRA Ease

Train LoRAs with Ease