|
---
|
|
license: mit
|
|
---
|
|
|
|
## Model Summary
|
|
|
|
Video-CCAM-14B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.
|
|
|
|
## Usage
|
|
|
|
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
|
|
```
|
|
pip install -U pip torch transformers peft decord pysubs2 imageio
|
|
```
|
|
|
|
## Inference
|
|
|
|
```
|
|
import os
|
|
import torch
|
|
from PIL import Image
|
|
from transformers import AutoModel
|
|
|
|
from eval import load_decord
|
|
|
|
os.environ['TOKENIZERS_PARALLELISM'] = 'false'
|
|
|
|
videoccam = AutoModel.from_pretrained(
|
|
'<your_local_path_1>',
|
|
trust_remote_code=True,
|
|
torch_dtype=torch.bfloat16,
|
|
device_map='auto',
|
|
_attn_implementation='flash_attention_2',
|
|
# llm_name_or_path='<your_local_llm_path>',
|
|
# vision_encoder_name_or_path='<your_local_vision_encoder_path>'
|
|
)
|
|
|
|
|
|
messages = [
|
|
[
|
|
{
|
|
'role': 'user',
|
|
'content': '<image>\nDescribe this image in detail.'
|
|
}
|
|
], [
|
|
{
|
|
'role': 'user',
|
|
'content': '<video>\nDescribe this video in detail.'
|
|
}
|
|
]
|
|
]
|
|
|
|
images = [
|
|
Image.open('assets/example_image.jpg').convert('RGB'),
|
|
load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
|
|
]
|
|
|
|
response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)
|
|
|
|
print(response)
|
|
```
|
|
|
|
Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.
|
|
|
|
### Benchmarks
|
|
|
|
|Benchmark|Video-CCAM-14B|Video-CCAM-14B-v1.1|
|
|
|:-:|:-:|:-:|
|
|
|MVBench (32 frames)|61.88|63.08|
|
|
|MSVD-QA (32 frames)|76.3/4.1|78.6/4.2|
|
|
|MSRVTT-QA (32 frames)|59.0/3.5|66.3/3.8|
|
|
|ActivityNet-QA (32 frames)|58.3/3.7|60.4/3.8|
|
|
|TGIF-QA (32 frames)|84.1/4.5|84.4/4.5|
|
|
|Video-MME (w/o sub, 96 frames)|53.2|53.9|
|
|
|Video-MME (w sub, 96 frames)|57.2|56.1|
|
|
|MLVU (M-Avg, 96 frames)|60.2|63.1|
|
|
|MLVU (G-Avg, 96 frames)|4.11|4.01|
|
|
|VideoVista (96 frames)|68.43|76.55|
|
|
|
|
* The accuracies and scores of MSVD-QA,MSRVTT-QA,ActivityNet-QA,TGIF-QA are evaluated by `gpt-3.5-turbo-0125`.
|
|
|
|
## Acknowledgement
|
|
|
|
* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
|
|
* [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct): Powerful language models developed by Microsoft.
|
|
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.
|
|
|
|
## License
|
|
The model is licensed under the MIT license.
|
|
|