Learning Vision from Models Rivals Learning Vision from Data Paper • 2312.17742 • Published Dec 28, 2023 • 16
Gemini in Reasoning: Unveiling Commonsense in Multimodal Large Language Models Paper • 2312.17661 • Published Dec 29, 2023 • 14
Boosting Large Language Model for Speech Synthesis: An Empirical Study Paper • 2401.00246 • Published Dec 30, 2023 • 12
A Comprehensive Study of Knowledge Editing for Large Language Models Paper • 2401.01286 • Published Jan 2, 2024 • 17
DocLLM: A layout-aware generative language model for multimodal document understanding Paper • 2401.00908 • Published Dec 31, 2023 • 182
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers Paper • 2401.01974 • Published Jan 3, 2024 • 6
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model Paper • 2401.02330 • Published Jan 4, 2024 • 15
DiarizationLM: Speaker Diarization Post-Processing with Large Language Models Paper • 2401.03506 • Published Jan 7, 2024 • 13
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation Paper • 2401.04092 • Published Jan 8, 2024 • 21
ANIM-400K: A Large-Scale Dataset for Automated End-To-End Dubbing of Video Paper • 2401.05314 • Published Jan 10, 2024 • 11
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models Paper • 2401.05252 • Published Jan 10, 2024 • 48
Distilling Vision-Language Models on Millions of Videos Paper • 2401.06129 • Published Jan 11, 2024 • 16
Improving fine-grained understanding in image-text pre-training Paper • 2401.09865 • Published Jan 18, 2024 • 17
Rethinking FID: Towards a Better Evaluation Metric for Image Generation Paper • 2401.09603 • Published Nov 30, 2023 • 17
Understanding Video Transformers via Universal Concept Discovery Paper • 2401.10831 • Published Jan 19, 2024 • 8
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads Paper • 2401.10774 • Published Jan 19, 2024 • 54