6 1 110

Shreyas

Shreyas094

AI & ML interests

None yet

Recent Activity

liked a Space 1 day ago

KoonJamesZ/ocr-table-v2

liked a Space 7 days ago

akhaliq/anychat

updated a Space 8 days ago

Shreyas094/SearXNG-AI-v2

View all activity

Organizations

Shreyas094's activity

reacted to Jaward's post with 👍 3 months ago

Post

1241

This is supercool!!
Explores o1-like multimodal reasoning.
Multi-agents with DPO is a nice touch 👍
Paper: https://arxiv.org/pdf/2411.14432
Code: https://github.com/dongyh20/Insight-V

reacted to maxiw's post with 👍 3 months ago

Post

2189

You can now try out computer use models from the hub to automate your local machine with https://github.com/askui/vision-agent. 💻

import time
from askui import VisionAgent

with VisionAgent() as agent:
    agent.tools.webbrowser.open_new("http://www.google.com")
    time.sleep(0.5)
    agent.click("search field in the center of the screen", model_name="Qwen/Qwen2-VL-7B-Instruct")
    agent.type("cats")
    agent.keyboard("enter")
    time.sleep(0.5)
    agent.click("text 'Images'", model_name="AskUI/PTA-1")
    time.sleep(0.5)
    agent.click("second cat image", model_name="OS-Copilot/OS-Atlas-Base-7B")

Currently these models are integrated with Gradio Spaces API. Also planning to add local inference soon!

Currently supported:
- Qwen/Qwen2-VL-7B-Instruct
- Qwen/Qwen2-VL-2B-Instruct
- AskUI/PTA-1
- OS-Copilot/OS-Atlas-Base-7B

3 replies

posted an update 5 months ago

Post

603

Is there any good multimodal pdf rag application, my task is to extract tables from unstructured pdfs and convert the same to xlsx file. Current python libraries are not capable of doing the same task with ease, imo vision models are capable of handling such task