We have recently merged Video-LLaVA to @huggingface transformers! 🤗 🎞️ What makes this model different? keep reading ⇊ data:image/s3,"s3://crabby-images/3ffd0/3ffd01d95a2f44cb69cd4199fd79051889c983cb" alt="video" [Demo](https://t.co/MVP14uEj9e) | [Model](https://t.co/oqSCMUqwJo) See below how to initialize the model and processor and infer ⬇️ data:image/s3,"s3://crabby-images/44653/44653bfb04cc355dd0e67cb001e4ef9eaa1d3789" alt="image_1" Compared to other models that take image and video input and either project them separately or downsampling video and projecting selected frames, Video-LLaVA is converting images and videos to unified representation and project them using a shared projection layer. data:image/s3,"s3://crabby-images/308e2/308e2d2ea71f14207152e122bdae2f94f4ca7402" alt="image_2" It uses Vicuna 1.5 as the language model and LanguageBind's own encoders that's based on OpenCLIP, these encoders project the modalities to an unified representation before passing to projection layer. data:image/s3,"s3://crabby-images/e05a4/e05a47dd503da2c3e0e4a033ef36ab0cceaad200" alt="image_3" I feel like one of the coolest features of this model is the joint understanding which is also introduced recently with many models it's a relatively older model but ahead of it's time and works very well! data:image/s3,"s3://crabby-images/8383c/8383c1b0fcd18950469b7c562e6e590868294fd7" alt="image_4" > [!TIP] Ressources: [Video-LLaVA: Learning United Visual Representation by Alignment Before Projection](https://arxiv.org/abs/2311.10122) by Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan (2023) [GitHub](https://github.com/PKU-YuanGroup/Video-LLaVA) [Hugging Face documentation](https://huggingface.co/docs/transformers/main/en/model_doc/video_llava) > [!NOTE] [Original tweet](https://x.com/mervenoyann/status/1816427325073842539) (July 25, 2024)