---
license: mit
---

## Model Summary

Video-CCAM-14B-v1.1 is a lightweight Video-MLLM developed by TencentQQ Multimedia Research Team.

## Usage

Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.9/3.10.
```
pip install -U pip torch transformers peft decord pysubs2 imageio
```

## Inference

```
import os
import torch
from PIL import Image
from transformers import AutoModel

from eval import load_decord

os.environ['TOKENIZERS_PARALLELISM'] = 'false'

videoccam = AutoModel.from_pretrained(
    '<your_local_path_1>',
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map='auto',
    _attn_implementation='flash_attention_2',
    # llm_name_or_path='<your_local_llm_path>',
    # vision_encoder_name_or_path='<your_local_vision_encoder_path>'
)


messages = [
    [
        {
            'role': 'user',
            'content': '<image>\nDescribe this image in detail.'
        }
    ], [
        {
            'role': 'user',
            'content': '<video>\nDescribe this video in detail.'
        }
    ]
]

images = [
    Image.open('assets/example_image.jpg').convert('RGB'),
    load_decord('assets/example_video.mp4', sample_type='uniform', num_frames=32)
]

response = videoccam.chat(messages, images, max_new_tokens=512, do_sample=False)

print(response)
```

Please refer to [Video-CCAM](https://github.com/QQ-MM/Video-CCAM) for more details.

### Benchmarks

|Benchmark|Video-CCAM-14B|Video-CCAM-14B-v1.1|
|:-:|:-:|:-:|
|MVBench (32 frames)|61.88|63.08|
|MSVD-QA (32 frames)|76.3/4.1|78.6/4.2|
|MSRVTT-QA (32 frames)|59.0/3.5|66.3/3.8|
|ActivityNet-QA (32 frames)|58.3/3.7|60.4/3.8|
|TGIF-QA (32 frames)|84.1/4.5|84.4/4.5|
|Video-MME (w/o sub, 96 frames)|53.2|53.9|
|Video-MME (w sub, 96 frames)|57.2|56.1|
|MLVU (M-Avg, 96 frames)|60.2|63.1|
|MLVU (G-Avg, 96 frames)|4.11|4.01|
|VideoVista (96 frames)|68.43|76.55|

* The accuracies and scores of MSVD-QA,MSRVTT-QA,ActivityNet-QA,TGIF-QA are evaluated by `gpt-3.5-turbo-0125`.

## Acknowledgement

* [xtuner](https://github.com/InternLM/xtuner): Video-CCAM-14B is trained using the xtuner framework. Thanks for their excellent works!
* [Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct): Powerful language models developed by Microsoft.
* [SigLIP SO400M](https://huggingface.co/google/siglip-so400m-patch14-384): Outstanding vision encoder developed by Google.

## License
The model is licensed under the MIT license.