Several questions on the same video
Hello,
How can I ask several questions on the same video?
I tried this:
try:
messages = [
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Describe this video"}
]
},
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "how many people on the video ?"}
]
},
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Is there any violence on the video ? (answer "yes" or "no"."}
]
}
]
But, i have a bug:
(smolvlm2-env) F:\CODE\smolvlm2>python smol4.py
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 2/2 [00:03<00:00, 1.54s/it]
----- DΓ©but de l'itΓ©ration 1 -----
Message: {'role': 'user', 'content': [{'type': 'video', 'path': 'P:\1.mp4'}, {'type': 'text', 'text': 'Describe this video'}]}
Erreur: string indices must be integers
(it works with a single question)
Thx in advance
Hi @delphijb ,
I have been trying to see how this can be supported. For batch inference, right now you need to do this: (UPDATE: this is already supported, just need to add padding=True)
video_path= "example.mp4"
conversation1 = [
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Describe this video"}
]
}
]
conversation2 = [
{
"role": "user",
"content": [
{"type": "video", "path": video_path},
{"type": "text", "text": "Summarize this video"}
]
}
]
conversations = [conversation1, conversation2]
batched_inputs = processor.apply_chat_template(
conversations,
add_generation_prompt=True,
tokenize=True,
padding=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
batch_inputs.append(inputs)
torch.cuda.reset_peak_memory_stats()
generated_ids = model.generate(**batched_inputs, do_sample=True, max_new_tokens=1024)
peak_mem = torch.cuda.max_memory_allocated()
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0], '\n\n', generated_texts[1])
print(f"Peak GPU memory: {peak_mem / 1024**3:.2f} GB")
The main issue with asking multiple questions about one video
is that the "video" is part of the user prompt. Therefore, this is not possible to support, as you basically want to "swap out" part of the user prompt and keep the other parts the same. As such, I think batch inference is basically what you would get anyway, as you would need to duplicate the image/video anyway.
You could also use the multi-turn conversation capability of the model, but this is sequential, not parallel.