Explores o1-like multimodal reasoning.
Multi-agents with DPO is a nice touch ๐
Paper: https://arxiv.org/pdf/2411.14432
Code: https://github.com/dongyh20/Insight-V
import time
from askui import VisionAgent
with VisionAgent() as agent:
agent.tools.webbrowser.open_new("http://www.google.com")
time.sleep(0.5)
agent.click("search field in the center of the screen", model_name="Qwen/Qwen2-VL-7B-Instruct")
agent.type("cats")
agent.keyboard("enter")
time.sleep(0.5)
agent.click("text 'Images'", model_name="AskUI/PTA-1")
time.sleep(0.5)
agent.click("second cat image", model_name="OS-Copilot/OS-Atlas-Base-7B")
Hahaha atleast someone got it
Agents and function calling tools is something that I recently explored and seems promising. I am exploring the possibilities.
Hey john currently the open source models are not that good with coding, even GPT for that but Claude 3.5 sonnet is the best with limited code errors. Maybe a model trained on codes specifically would be able to handle such task. But the idea is really good, also I found a lot of good spaces in the above link, thank you so much.
Hey thank you so much John that was really insightful. I will surely read the above post.
Hi John, thanks so much for the contribution. However, I would like to implement some upgrades to my RAG setup for PDF summarization task. Currently I have not worked alot on my Vector DB creation, chunking, indexing and embeddings part. I feel working on these functions shall improve the retrieval process, especially when it comes to 100-200 pager research documents. If possible, can you provide some suggestion on that part. Thanks