VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Abstract
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
Community
You are basically modifying LLaVA , the paper gives a totally different impression. But inside the code, its all LLaVA.
if "videollama" in model_name.lower():
# Load LLaVA model
I mean the predecessor of this is clearly LLaVA and IMHO u missing here some important details on the paper.
Thanks for pointing out this.
Yes, the codebase of VideoLLaMA2 is adapted from LLaVA. We have mentioned this and given credit to LLaVA in several places (e.g., videollama2_arch.py, videollama2_mistral.py, train.py, project page). We will make this clearer in the next version of our technical report.