JaronTHU
/

Video-CCAM-14B-v1.1

Model card Files Files and versions Community

Video-CCAM-14B-v1.1 / README.md

jaronfei

first commit

c61bdc1 6 months ago

|

history blame contribute delete

2.57 kB

	---
	license: mit
	---

	## Model Summary

	Video-CCAM-14B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.

	## Usage

	Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
	```
	pip install -U pip torch transformers peft decord pysubs2 imageio
	```

	## Inference

	```
	import os
	import torch
	from PIL import Image
	from transformers import AutoModel

	from eval import load_decord

	os.environ['TOKENIZERS_PARALLELISM'] = 'false'

	videoccam = AutoModel.from_pretrained(
	'<your_local_path_1>',
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map='auto',
	_attn_implementation='flash_attention_2',
	# llm_name_or_path='<your_local_llm_path>',
	# vision_encoder_name_or_path='<your_local_vision_encoder_path>'
	)


	messages = [
	[
	{
	'role': 'user',
	'content': '<image>\nDescribe this image in detail.'
	}
	], [
	{
	'role': 'user',
	'content': '<video>\nDescribe this video in detail.'
	}
	]
	]

	images = [
	Image.open('assets/example_image.jpg').convert('RGB'),
	load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
	]

	response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)

	print(response)
	```

	Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.

	### Benchmarks

	\|Benchmark\|Video-CCAM-14B\|Video-CCAM-14B-v1.1\|
	\|:-:\|:-:\|:-:\|
	\|MVBench (32 frames)\|61.88\|63.08\|
	\|MSVD-QA (32 frames)\|76.3/4.1\|78.6/4.2\|
	\|MSRVTT-QA (32 frames)\|59.0/3.5\|66.3/3.8\|
	\|ActivityNet-QA (32 frames)\|58.3/3.7\|60.4/3.8\|
	\|TGIF-QA (32 frames)\|84.1/4.5\|84.4/4.5\|
	\|Video-MME (w/o sub, 96 frames)\|53.2\|53.9\|
	\|Video-MME (w sub, 96 frames)\|57.2\|56.1\|
	\|MLVU (M-Avg, 96 frames)\|60.2\|63.1\|
	\|MLVU (G-Avg, 96 frames)\|4.11\|4.01\|
	\|VideoVista (96 frames)\|68.43\|76.55\|

	* The accuracies and scores of MSVD-QA,MSRVTT-QA,ActivityNet-QA,TGIF-QA are evaluated by `gpt-3.5-turbo-0125`.

	## Acknowledgement

	* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
	* [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct): Powerful language models developed by Microsoft.
	* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.

	## License
	The model is licensed under the MIT license.