Vision - a henern Collection

henern 's Collections

Agent

Data

Context Scaling

Vision

Audio

Domains

Vision

updated 19 days ago

Video/Image/Gif/etc.

Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Paper • 2402.17177 • Published Feb 27 • 88
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions

Paper • 2402.17485 • Published Feb 27 • 185
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

Paper • 2403.00522 • Published Mar 1 • 44
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

Paper • 2403.04692 • Published Mar 7 • 40
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

Paper • 2311.12793 • Published Nov 21, 2023 • 18
FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Paper • 2403.17008 • Published Mar 25 • 18
An Introduction to Vision-Language Modeling

Paper • 2405.17247 • Published May 27 • 84
Depth Anything V2

Paper • 2406.09414 • Published Jun 13 • 91
Vision language models are blind

Paper • 2407.06581 • Published Jul 9 • 80
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1 • 103
MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3 • 74
Imagen 3

Paper • 2408.07009 • Published Aug 13 • 60
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16 • 96
Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published 28 days ago • 109
CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published 22 days ago • 55