VideoLLaMA-7B-Charades-VTune Model

Model details

We trained Video-LLaMA using VTune, a developed instruction-tuning method specifically designed to account for consistency.

For the tuning, we utilized 5K training videos from Charades-STA with 99K automatically generated annotations.

Evaluation

We evaluated the model on Charades-CON and Charades-STA.

  • Charades-CON

    Metric Value
    Ground 54.4
    R-Ground 38.2 (70.3)
    S-Ground 10.9 (20.0)
    H-Verify 30.7 (56.6)
    C-Verify 30.0 (55.2)
  • Charades-STA

    Metric Value
    R@1 IoU=0.3 51.18
    R@1 IoU=0.5 37.15
    R@1 IoU=0.7 20.11
    mIoU 35.29

Paper and Code for more information: Paper, Code

Citation

If you find our research and codes useful, please consider starring our repository and citing our paper:

@article{jung2024consistency,
  title={On the Consistency of Video Large Language Models in Temporal Comprehension},
  author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
  journal={arXiv preprint arXiv:2411.12951},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Collection including mjjung/VideoLLaMA-7B-Charades-VTune