An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Paper • 2406.09415 • Published Jun 13 • 50
4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities Paper • 2406.09406 • Published Jun 13 • 13
VideoGUI: A Benchmark for GUI Automation from Instructional Videos Paper • 2406.10227 • Published Jun 14 • 9
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model Paper • 2407.16198 • Published Jul 23 • 13
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding Paper • 2407.15754 • Published Jul 22 • 19
Theia: Distilling Diverse Vision Foundation Models for Robot Learning Paper • 2407.20179 • Published Jul 29 • 45
SHIC: Shape-Image Correspondences with no Keypoint Supervision Paper • 2407.18907 • Published Jul 26 • 39
Improving 2D Feature Representations by 3D-Aware Fine-Tuning Paper • 2407.20229 • Published Jul 29 • 7
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments Paper • 2408.10945 • Published Aug 20 • 6
Agent-to-Sim: Learning Interactive Behavior Models from Casual Longitudinal Videos Paper • 2410.16259 • Published 19 days ago • 4