Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models Paper • 2402.17177 • Published Feb 27 • 88
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions Paper • 2402.17485 • Published Feb 27 • 185
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation Paper • 2403.04692 • Published Mar 7 • 40
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions Paper • 2311.12793 • Published Nov 21, 2023 • 18
FlashFace: Human Image Personalization with High-fidelity Identity Preservation Paper • 2403.17008 • Published Mar 25 • 18
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published 28 days ago • 109
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published 22 days ago • 55