Several questions on the same video

#8
by delphijb - opened

Hello,
How can I ask several questions on the same video?
I tried this:

try:
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Describe this video"}
]
},
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "how many people on the video ?"}
]
},
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Is there any violence on the video ? (answer "yes" or "no"."}
]
}
]

But, i have a bug:

(smolvlm2-env) F:\CODE\smolvlm2>python smol4.py
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00, 1.54s/it]
----- DΓ©but de l'itΓ©ration 1 -----
Message: {'role': 'user', 'content': [{'type': 'video', 'path': 'P:\1.mp4'}, {'type': 'text', 'text': 'Describe this video'}]}
Erreur: string indices must be integers

(it works with a single question)

Thx in advance

Hugging Face TB Research org
β€’
edited about 4 hours ago

Hi @delphijb ,

I have been trying to see how this can be supported. For batch inference, right now you need to do this: (UPDATE: this is already supported, just need to add padding=True)

video_path= "example.mp4"
conversation1 = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": "Describe this video"}
        ]
    }
]

conversation2 = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": "Summarize this video"}
        ]
    }
]

conversations = [conversation1, conversation2]

batched_inputs = processor.apply_chat_template(
    conversations,
    add_generation_prompt=True,
    tokenize=True,
    padding=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
batch_inputs.append(inputs)


torch.cuda.reset_peak_memory_stats()
generated_ids = model.generate(**batched_inputs, do_sample=True, max_new_tokens=1024)
peak_mem = torch.cuda.max_memory_allocated()

generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0], '\n\n', generated_texts[1])

print(f"Peak GPU memory: {peak_mem / 1024**3:.2f} GB")

The main issue with asking multiple questions about one video is that the "video" is part of the user prompt. Therefore, this is not possible to support, as you basically want to "swap out" part of the user prompt and keep the other parts the same. As such, I think batch inference is basically what you would get anyway, as you would need to duplicate the image/video anyway.

You could also use the multi-turn conversation capability of the model, but this is sequential, not parallel.

Sign up or log in to comment