Video-Guided Foley Sound Generation with Multimodal Controls Paper • 2411.17698 • Published Nov 26, 2024 • 9
3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination Paper • 2406.05132 • Published Jun 7, 2024 • 29
Self-Supervised Video Forensics by Audio-Visual Anomaly Detection Paper • 2301.01767 • Published Jan 4, 2023
Binding Touch to Everything: Learning Unified Multimodal Tactile Representations Paper • 2401.18084 • Published Jan 31, 2024
Images that Sound: Composing Images and Sounds on a Single Canvas Paper • 2405.12221 • Published May 20, 2024 • 1
PLANNER: Generating Diversified Paragraph via Latent Language Diffusion Model Paper • 2306.02531 • Published Jun 5, 2023 • 1
Divide-or-Conquer? Which Part Should You Distill Your LLM? Paper • 2402.15000 • Published Feb 22, 2024 • 21
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation Paper • 2303.11329 • Published Mar 20, 2023 • 1
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent Paper • 2309.12311 • Published Sep 21, 2023 • 17
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation Paper • 2303.11329 • Published Mar 20, 2023 • 1
Understanding 3D Object Interaction from a Single Image Paper • 2305.09664 • Published May 16, 2023 • 2
Phenaki: Variable Length Video Generation From Open Domain Textual Description Paper • 2210.02399 • Published Oct 5, 2022 • 3