5 40 11

Yuhao Dong

THUdyh

AI & ML interests

None yet

Recent Activity

updated a model 2 days ago

THUdyh/Ola-7b

updated a model 3 days ago

THUdyh/Ola_speech_encoders

published a model 3 days ago

THUdyh/Ola_speech_encoders

View all activity

Organizations

THUdyh's activity

upvoted a paper 19 days ago

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published 21 days ago • 24

upvoted a paper 23 days ago

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper • 2501.01957 • Published 25 days ago • 42

upvoted a paper 25 days ago

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published 27 days ago • 98

upvoted 4 papers about 1 month ago

Byte Latent Transformer: Patches Scale Better Than Tokens

Paper • 2412.09871 • Published Dec 13, 2024 • 89

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models

Paper • 2412.09645 • Published Dec 10, 2024 • 35

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 139

GenEx: Generating an Explorable World

Paper • 2412.09624 • Published Dec 12, 2024 • 89

upvoted 13 papers about 2 months ago

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published Dec 12, 2024 • 45

Phi-4 Technical Report

Paper • 2412.08905 • Published Dec 12, 2024 • 106

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 93

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Paper • 2412.05271 • Published Dec 6, 2024 • 129

Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection

Paper • 2412.04455 • Published Dec 5, 2024 • 37

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 105

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 59

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Paper • 2412.00493 • Published Nov 30, 2024 • 16

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 124