VideoLLaMA-7B-Charades-VTune Model

Model details

We trained Video-LLaMA using VTune, a developed instruction-tuning method specifically designed to account for consistency.

For the tuning, we utilized 5K training videos from Charades-STA with 99K automatically generated annotations.

Evaluation

We evaluated the model on Charades-CON and Charades-STA.

Charades-CON

Metric Value

Ground 54.4

R-Ground 38.2 (70.3)

S-Ground 10.9 (20.0)

H-Verify 30.7 (56.6)

C-Verify 30.0 (55.2)
Charades-STA

Metric Value

R@1 IoU=0.3 51.18

R@1 IoU=0.5 37.15

R@1 IoU=0.7 20.11

mIoU 35.29

Metric	Value
Ground	54.4
R-Ground	38.2 (70.3)
S-Ground	10.9 (20.0)
H-Verify	30.7 (56.6)
C-Verify	30.0 (55.2)

Metric	Value
R@1 IoU=0.3	51.18
R@1 IoU=0.5	37.15
R@1 IoU=0.7	20.11
mIoU	35.29

Paper and Code for more information: Paper, Code

Citation

If you find our research and codes useful, please consider starring our repository and citing our paper:

@article{jung2024consistency,
  title={On the Consistency of Video Large Language Models in Temporal Comprehension},
  author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
  journal={arXiv preprint arXiv:2411.12951},
  year={2024}
}

mjjung
/

VideoLLaMA-7B-Charades-VTune

VideoLLaMA-7B-Charades-VTune Model

Model details

Evaluation

Citation

Collection including mjjung/VideoLLaMA-7B-Charades-VTune

Checkpoints using VTune