JaronTHU/Video-CCAM-9B · Hugging Face

Model Summary

Video-CCAM-9B is a Video-MLLM built on Yi-1.5-9B-Chat and SigLIP SO400M.

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.10：

torch==2.1.0
torchvision==0.16.0
transformers==4.40.2
peft==0.10.0

Please refer to Video-CCAM on inference and evaluation.

#Frames.	32	96
w/o subs	50.0	50.6
w subs	53.1	54.9

xtuner: Video-CCAM-9B is trained using the xtuner framework. Thanks for their excellent works!
Yi-1.5-9B-Chat: Great language models developed by 01.AI.
SigLIP SO400M: Outstanding vision encoder developed by Google.

The model is licensed under the MIT license.